Fundamentals of Defect Prediction

The goal of defect prediction is to build a mathematical function that determines if a software component (class, method or file) contains a bug based on its features.

Why Predict Defects ?

  • Cost Efficiency: Finding and correcting defects early is significantly cheaper than post-release maintenance.
  • Resource Allocation: Helps teams focus testing efforts on “buggy” areas.

Core Challenges

  • Identifying reliable predictors/features.
  • Preventing overfitting to specific project data.
  • Validating predictions in real-time (JIT).
  • Adapting models to different project needs.

Classic Complexity Metrics

Early research focused on identifying code metrics that correlate with defect density.

McCabe Cyclomatic Complexity

This measures complexity based on the number of linearly independent paths through the code’s flow graph.

  • Complex Formula:
    • Where
  • Simple Calculation:

Tip

Simple Explanation: if, while, for and do...while without compound logic. For compound conditions (e.g., if (a > b && c > d)), you must count each individual condition seperately.

  • Scope Limitation: When calculating complexity, do not “go deep” into external libraries or classes; treat them as single units.

Halstead Complexity Measures

These focus on the physical structure of the code, specifically operators (e.g., +, %, if) and operands (e.g., variables, constants)

  • Definition
    • : Number of distinct operators/operands.
    • : Total count of operators/operands.
  • Calculations:
    • Vocabulary: .
    • Length: .
    • Volume: .
    • Difficulty: .

Within-Project Prediction (WPDP)

WPDP involves training and testing a model using data from the same project to capture its specific data distribution.

Just-In-Time (JIT) Defect Prediction

This approach predicts bugs at the commit level, allowing developers to fix issues immediately as they write code.

The SZZ Algorithm Process:

  1. Identify Fixes: Search change logs for keywords like “fix” or “bug”.
  2. Extract Hunks: Use a delta tool to find modified code regions (hunks) in the bug fix.
  3. Trace Origins: Track modified lines back in the revision history to find when they were first introduces.
  4. Label Changes: Mark the discovered origin as “bug-introduction change” and other changes as “clean”.

Machine Learning Model: Support Vector Machines (SVM)

SVM is commonly used for binary classification (Buggy vs. Clean)

  • Intuition: It tries to find the optimal “hyperplane” (separator) that maximises the margin between classes in the feature space
  • Support Vectors: These are the training examples closest to the hyperplane; only these points determine the separator’s position.

Cross-Project Prediction (HDP)

When a project lacks historical data, models must be trained on a source project and applied to a different target project. This is difficult because metrics often mismatch between projects.

Heterogeneous Defect Prediction (HDP) Algorithm

  1. Metric Selection: Identify the top most important metrics in the source project using techniques like Chi-Square
  2. Metric Matching: Pair source metrics with target metrics .
  3. Selection Logic: Use maximum weighted bipartite matching to select pairs with the highest matching scores without duplicates.
  4. Matching Score Calculation:
    1. Percentile Method: Compare nine percentiles (10th-90th) of metric values.
    2. KS-Test: Use a non-parametric test to return a -value indicating similarity.
    3. Spearman Correlation: Measure how the ranks of two samples correlate.

Learned Representation (Deep Learning)

Classic metrics are “hand-crafted” and often fail to capture the “semantic” (meaning) of code.

Deep Belief Networks (DBN)

DBNs automatically learn semantic features from source code, applicable to both WPDP and CPDP.

Step-by-Step Workflow:

  1. Parsing: Convert source code into an Abstract Syntax Tree (AST).
  2. Token Extraction: Extract nodes for method invocations, declarations, and control-flow (e.g., if, while).
  3. Mapping: Convert tokens into unique integer IDs.
  4. Normalisation: Pad vectors to equal lengths.
  5. Unsupervised Learning: The DBN adjusts weights to “reconstruct” the input data, generating condensed semantic features.

Explainable Defect Prediction

Black-box models are often untrusted. Explainability provides the “why” behind a “buggy” prediction.

PyExplainer

A model-agnostic framework that provides local, rule-based explanation.

The Process:

  1. Neighbour Generation: Identify “actual neighbours” in training data and create synthetic neighbours using crossover and mutation.
  2. Global Observation: Use the complex global model to predict labels for these synthetic neighbours to learn its behaviour.
  3. Local Modelling: Build a RuleFit model (combination of decision trees and linear regression) on this local neighbourhood.
  4. Explanation Extraction: Provides rules, importance scores, and coefficients.

TIP

Example Explanation: Churn > 100 & #Reviewers < 2 => DEFECT. This indicates the commit is buggy because it changed too many lines and had too few reviewers.