07 Evaluation in Intelligent Software Engineering

Multi-Objective Optimisation and the Pareto Front

Many software engineering tasks involve trade-offs (e..g, Cost vs Coverage in testing or Cohesion vs Coupling in design). Since no single “perfect” solution exists, we look for a Pareto front.

Petro Dominance Relation

A solution $x$ dominates a solution $x^{'}$ (denoted $x^{'} ≺ x$ ) if and only if:

$\forall i \in {1, 2, \dots, M}, f_{i} (x^{'}) \leq f_{i} (x)$ (It is at least as good in all objectives).
$\exists j \in {1, 2, \dots, M}$ such that $f_{j} (x^{'}) < f_{j} (x)$ (It is strictly better in at least one objective).

Tip

Multi-objective Search-Based Software Engineering (SBSE) aims to find a Pareto front approximation, a set of non-dominated solutions that represent the best possible trade-offs

Quantitative Evaluation Indicators

To compare how well different algorithms perform, we use specific metrics that measures different quality aspects.

Key Metrics

Generational Distance (GD): Measures the average distance from the solutions in your set to the nearest solution on the true Pareto front.
- Formula: $G D = \frac{( \sum _{i = 1}^{∣ A ∣} d _{i}^{2} ) ^{1/2}}{∣ A ∣}$
Inverted Generational Distance (IGD): Measures the average distance from the true Pareto front to your solution. Unlike GD, IGD reflects both convergence and spread
Hypervolume (HV): Measures the total volume of the objective space “covered” by the solution set relative to a reference point. It is a comprehensive indicator reflecting convergence, spread, and uniformity.
C-Metric: A pairwise comparison between two sets, calculating the proportion of solutions in one set that are dominated by al least one solution in the other.
Spread $(Δ)$ : Measures the diversity and distribution of solutions across the front.

Decision Maker (DM) Preferences

Evaluation should align with the user’s needs.

Clear Preference: Use weighted sums or utility functions if the relative importance of objectives is known.
Vague Preferences: Attempt to integrate tolerances (e.g., “ideally up to 3000 users”) into indicators or solution sets.
Specific Regions: If the DM wants a balance trade-off, use indicators that favour knee points (like HV). If they want extremes (best in one objective regardless of others), use HV with a distant reference points.

Important

If preferences are unavailable, a good solution set must excel in four areas: Convergence (closeness to the front), Spread (coverage), Uniformity (even distribution), and Cardinality (number of solutions).

Comparing Stochastic Algorithms

Intelligent algorithms are stochastic, they use pseudorandom numbers for initial populations, mutations etc. Running an algorithm once is insufficient because the result depends on the random seed.

Best Practices for Comparison

Multiple Runs: Execute the algorithm at least 30 times with different seeds to observe “typical” behaviour.
Robust Statistics: Use Medians and Quartiles rather than Means and Standard Deviations. Medians are less affected by outliers (extreme, unusual values).

Statistical Hypothesis Testing

A scientific method to determine if Algorithm A is truly better than Algorithm B

The P-Value and Level of Significance

Null Hypothesis ( $H_{0}$ ): Assumes there is no difference between groups.
Alternative Hypothesis ( $H_{1}$ ): Assumes a significant difference exists.
P-value: The probability of observing your results if $H_{0}$ were true.
If $p \leq 0.05$ : Reject $H_{0}$ . The difference is statistically significant.
If $p > 0.05$ : Do not reject $H_{0}$ . Any difference might be due to chance.

Choosing the Right Test

The choice depends on your data distribution and experimental setup:

Data Type	2 Groups (Independent)	2 Groups (Paired/Related)	N Groups ( $N > 2$ )
Parametric (Normal Dist.)	Unpaired t-test	Paired t-test	ANOVA
Non-Parametric (No Normality)	Wilcoxon Rank-Sum (Mann-Whitney U)	Wilcoxon Signed-Rank	Kruskal-Wallis / Friedman

WARNING

Common Mistake: Using parametric tests (like t-tests) when your data is not normally distributed. Non-parametric tests (like Wilcoxon) are more popular in SE because they are more robust to violations of assumptions.

Multiple Comparisons and Corrections

Running many pairwise tests (e.g., comparing 10 different algorithms) increases the risk of a Type 1 Error, finding a “significant” difference where none exists (a false positive).

Correction Methods:

Bonferroni Correction: Divide your significance level $(α)$ by the number of comparisons $(K)$
- Adjusted $α = 0.05 K$
Post-hoc Tests: $N$ -group tests (like Kruskal-Wallis) tell you that a difference exists but not where. Use post-hoc tests like Dunn or Nemenyi to identify which specific pairs are different.

Ayush Acharjya's Notes

Explorer

07 Evaluation in Intelligent Software Engineering

Multi-Objective Optimisation and the Pareto Front

Petro Dominance Relation

Quantitative Evaluation Indicators

Key Metrics

Decision Maker (DM) Preferences

Comparing Stochastic Algorithms

Best Practices for Comparison

Statistical Hypothesis Testing

The P-Value and Level of Significance

Choosing the Right Test

Multiple Comparisons and Corrections

Correction Methods:

Graph View

Table of Contents

Backlinks