Multi-Objective Optimisation and the Pareto Front

Many software engineering tasks involve trade-offs (e..g, Cost vs Coverage in testing or Cohesion vs Coupling in design). Since no single “perfect” solution exists, we look for a Pareto front.

Petro Dominance Relation

A solution dominates a solution (denoted ) if and only if:

  1. (It is at least as good in all objectives).
  2. such that (It is strictly better in at least one objective).

Tip

Multi-objective Search-Based Software Engineering (SBSE) aims to find a Pareto front approximation, a set of non-dominated solutions that represent the best possible trade-offs

Quantitative Evaluation Indicators

To compare how well different algorithms perform, we use specific metrics that measures different quality aspects.

Key Metrics

  • Generational Distance (GD): Measures the average distance from the solutions in your set to the nearest solution on the true Pareto front.
    • Formula:
  • Inverted Generational Distance (IGD): Measures the average distance from the true Pareto front to your solution. Unlike GD, IGD reflects both convergence and spread
  • Hypervolume (HV): Measures the total volume of the objective space “covered” by the solution set relative to a reference point. It is a comprehensive indicator reflecting convergence, spread, and uniformity.
  • C-Metric: A pairwise comparison between two sets, calculating the proportion of solutions in one set that are dominated by al least one solution in the other.
  • Spread : Measures the diversity and distribution of solutions across the front.

Decision Maker (DM) Preferences

Evaluation should align with the user’s needs.

  • Clear Preference: Use weighted sums or utility functions if the relative importance of objectives is known.
  • Vague Preferences: Attempt to integrate tolerances (e.g., “ideally up to 3000 users”) into indicators or solution sets.
  • Specific Regions: If the DM wants a balance trade-off, use indicators that favour knee points (like HV). If they want extremes (best in one objective regardless of others), use HV with a distant reference points.

Important

If preferences are unavailable, a good solution set must excel in four areas: Convergence (closeness to the front), Spread (coverage), Uniformity (even distribution), and Cardinality (number of solutions).

Comparing Stochastic Algorithms

Intelligent algorithms are stochastic, they use pseudorandom numbers for initial populations, mutations etc. Running an algorithm once is insufficient because the result depends on the random seed.

Best Practices for Comparison

  • Multiple Runs: Execute the algorithm at least 30 times with different seeds to observe “typical” behaviour.
  • Robust Statistics: Use Medians and Quartiles rather than Means and Standard Deviations. Medians are less affected by outliers (extreme, unusual values).

Statistical Hypothesis Testing

A scientific method to determine if Algorithm A is truly better than Algorithm B

The P-Value and Level of Significance

  • Null Hypothesis (): Assumes there is no difference between groups.
  • Alternative Hypothesis (): Assumes a significant difference exists.
  • P-value: The probability of observing your results if were true.
  • If : Reject . The difference is statistically significant.
  • If : Do not reject . Any difference might be due to chance.

Choosing the Right Test

The choice depends on your data distribution and experimental setup:

Data Type2 Groups (Independent)2 Groups (Paired/Related)N Groups ()
Parametric (Normal Dist.)Unpaired t-testPaired t-testANOVA
Non-Parametric (No Normality)Wilcoxon Rank-Sum (Mann-Whitney U)Wilcoxon Signed-RankKruskal-Wallis / Friedman

WARNING

Common Mistake: Using parametric tests (like t-tests) when your data is not normally distributed. Non-parametric tests (like Wilcoxon) are more popular in SE because they are more robust to violations of assumptions.

Multiple Comparisons and Corrections

Running many pairwise tests (e.g., comparing 10 different algorithms) increases the risk of a Type 1 Error, finding a “significant” difference where none exists (a false positive).

Correction Methods:

  1. Bonferroni Correction: Divide your significance level by the number of comparisons
    • Adjusted
  2. Post-hoc Tests: -group tests (like Kruskal-Wallis) tell you that a difference exists but not where. Use post-hoc tests like Dunn or Nemenyi to identify which specific pairs are different.