Fundamentals of Defect Prediction
The goal of defect prediction is to build a mathematical function that determines if a software component (class, method or file) contains a bug based on its features.
Why Predict Defects ?
- Cost Efficiency: Finding and correcting defects early is significantly cheaper than post-release maintenance.
- Resource Allocation: Helps teams focus testing efforts on âbuggyâ areas.
Core Challenges
- Identifying reliable predictors/features.
- Preventing overfitting to specific project data.
- Validating predictions in real-time (JIT).
- Adapting models to different project needs.
Classic Complexity Metrics
Early research focused on identifying code metrics that correlate with defect density.
McCabe Cyclomatic Complexity
This measures complexity based on the number of linearly independent paths through the codeâs flow graph.
- Complex Formula:
- Where
- Simple Calculation:
Tip
Simple Explanation:
if,while,foranddo...whilewithout compound logic. For compound conditions (e.g.,if (a > b && c > d)), you must count each individual condition seperately.
- Scope Limitation: When calculating complexity, do not âgo deepâ into external libraries or classes; treat them as single units.
Halstead Complexity Measures
These focus on the physical structure of the code, specifically operators (e.g., +, %, if) and operands (e.g., variables, constants)
- Definition
- : Number of distinct operators/operands.
- : Total count of operators/operands.
- Calculations:
- Vocabulary: .
- Length: .
- Volume: .
- Difficulty: .
Within-Project Prediction (WPDP)
WPDP involves training and testing a model using data from the same project to capture its specific data distribution.
Just-In-Time (JIT) Defect Prediction
This approach predicts bugs at the commit level, allowing developers to fix issues immediately as they write code.
The SZZ Algorithm Process:
- Identify Fixes: Search change logs for keywords like âfixâ or âbugâ.
- Extract Hunks: Use a delta tool to find modified code regions (hunks) in the bug fix.
- Trace Origins: Track modified lines back in the revision history to find when they were first introduces.
- Label Changes: Mark the discovered origin as âbug-introduction changeâ and other changes as âcleanâ.
Machine Learning Model: Support Vector Machines (SVM)
SVM is commonly used for binary classification (Buggy vs. Clean)
- Intuition: It tries to find the optimal âhyperplaneâ (separator) that maximises the margin between classes in the feature space
- Support Vectors: These are the training examples closest to the hyperplane; only these points determine the separatorâs position.
Cross-Project Prediction (HDP)
When a project lacks historical data, models must be trained on a source project and applied to a different target project. This is difficult because metrics often mismatch between projects.
Heterogeneous Defect Prediction (HDP) Algorithm
- Metric Selection: Identify the top most important metrics in the source project using techniques like Chi-Square
- Metric Matching: Pair source metrics with target metrics .
- Selection Logic: Use maximum weighted bipartite matching to select pairs with the highest matching scores without duplicates.
- Matching Score Calculation:
- Percentile Method: Compare nine percentiles (10th-90th) of metric values.
- KS-Test: Use a non-parametric test to return a -value indicating similarity.
- Spearman Correlation: Measure how the ranks of two samples correlate.
Learned Representation (Deep Learning)
Classic metrics are âhand-craftedâ and often fail to capture the âsemanticâ (meaning) of code.
Deep Belief Networks (DBN)
DBNs automatically learn semantic features from source code, applicable to both WPDP and CPDP.
Step-by-Step Workflow:
- Parsing: Convert source code into an Abstract Syntax Tree (AST).
- Token Extraction: Extract nodes for method invocations, declarations, and control-flow (e.g.,
if,while). - Mapping: Convert tokens into unique integer IDs.
- Normalisation: Pad vectors to equal lengths.
- Unsupervised Learning: The DBN adjusts weights to âreconstructâ the input data, generating condensed semantic features.
Explainable Defect Prediction
Black-box models are often untrusted. Explainability provides the âwhyâ behind a âbuggyâ prediction.
PyExplainer
A model-agnostic framework that provides local, rule-based explanation.
The Process:
- Neighbour Generation: Identify âactual neighboursâ in training data and create synthetic neighbours using crossover and mutation.
- Global Observation: Use the complex global model to predict labels for these synthetic neighbours to learn its behaviour.
- Local Modelling: Build a RuleFit model (combination of decision trees and linear regression) on this local neighbourhood.
- Explanation Extraction: Provides rules, importance scores, and coefficients.
TIP
Example Explanation:
Churn > 100 & #Reviewers < 2 => DEFECT. This indicates the commit is buggy because it changed too many lines and had too few reviewers.