Development of a Rapid Software Mutation Testing Technique

Abstract

In software testing, one of the main drawbacks of mutation testing is the high processing time required to run analyses on large codebases. This work proposes an integration with the Git version control system so mutation testing can run faster by creating mutants only for newly introduced changes over the original codebase. To achieve this, an algorithm was developed to apply mutations only in source-code regions being updated in a new version (diffs), and metrics were collected to evaluate both runtime improvement and the impact on test-suite adequacy assessment. The work was consolidated through the implementation of an extension for the Stryker Mutator testing framework in JavaScript. The extension provided a viable proof of concept for improving the practical use of software mutation testing and was also evaluated in the context of software engineering and agile development. [4] [8] [18] [3]

1. Introduction

1.1 Context

This work is situated in test-driven development (TDD), where code is developed in small units and accompanied by tests, in agile development, which prioritizes continuous delivery of working software and rapid adaptation to changing requirements, and in high-reliability software engineering contexts where failures carry high cost. [3]

1.2 Test-Driven Development (TDD)

TDD consists of implementing software in small increments. First, tests are written to define the expected behavior of a unit; at this point, tests fail because implementation does not yet exist. Then implementation is added until tests pass, and the cycle repeats until the code reaches the desired level of quality and maintainability. [1] [2]

TDD cycle — Figure 1. Development cycle using TDD.

1.3 Mutation Testing

Even with strong unit-test coverage, tests can still be biased or incomplete. Mutation testing is a fault-based testing technique that evaluates test-suite adequacy by introducing small syntactic changes (mutants) and observing whether tests detect them. [4] [5]

A mutant contributes positively to the score when tests detect it (killed mutant). If it is not detected, it survives. The final mutation score is the ratio of killed mutants to total generated mutants.

The technique is based on the Competent Programmer Hypothesis (CPH): programmers usually produce code close to correct behavior, and many real faults can be represented by small syntactic deviations. [6]

Common mutation categories include:

Replacement of logical operators (OR, AND, NOT, XOR);
Replacement of comparison operators (==, !=, >, >=, <, <=);
Replacement of arithmetic operators (+, -, *, /);
Omission or replacement of function/method calls.

// Original code
function fibonacci(n) {
  if (n <= 0) {
    return 0;
  } else if (n == 1) {
    return 1;
  }
  return fibonacci(n - 1) + fibonacci(n - 2);
}

// Comparative operator mutation
function fibonacci(n) {
  if (n > 0) {
    return 0;
  } else if (n == 1) {
    return 1;
  }
  return fibonacci(n - 1) + fibonacci(n - 2);
}

// Arithmetic operator mutation
function fibonacci(n) {
  if (n <= 0) {
    return 0;
  } else if (n == 1) {
    return 1;
  }
  return fibonacci(n - 1) * fibonacci(n - 2);
}

1.4 Main Drawbacks of Mutation Testing

Mutation testing is computationally expensive because the test suite must be re-executed for each mutation. This makes routine usage difficult in large projects and often limits adoption to contexts where failure cost is high enough to justify the expense. [4]

Another drawback tied to execution cost is that it is not naturally suited to integration or end-to-end scenarios with heavy external dependencies, since each mutant may force expensive end-to-end re-execution.

1.5 Version Control Systems (VCS)

VCS tools are essential in modern software engineering: they manage change history, merge collaborative work, and support efficient source sharing. Git is one of the most widely used systems and is distributed, which allows local work without mandatory network access. [9] [8]

For software testing, VCS is useful because it tracks exactly where changes happen in source code, enabling tests to focus on modified regions. In this work, key Git terms are used as follows:

diff: text representation of changed regions in versioned source code;
branch: versioned line of development with its own change history;
commit: persistence of new changes in a branch as a new historical record.

2. Previous Work

This section presents prior approaches that target mutation testing cost reduction. [4]

2.1 Mutant Reduction Techniques

Previous work includes:

Mutant sampling;
Mutant clustering;
Selective mutants.

2.1.1 Mutant Sampling

Mutant sampling randomly selects a subset of mutants, reducing the number of test-suite re-executions. Prior studies reported that reducing mutants to 10% can significantly reduce execution cost while preserving a relevant portion of adequacy information. [13] [14]

2.1.2 Mutant Clustering

Instead of random selection, clustering applies classification strategies to group mutants and eliminate equivalents automatically, aiming to reduce volume without major adequacy loss. [15]

2.1.3 Selective Mutants

Selective mutation is based on the observation that some mutation operators generate many equivalent mutants. The goal is to choose a smaller operator set that preserves most of the effectiveness of a larger set. [16]

2.2 Execution-Cost Reduction Techniques

Other strategies optimize execution itself, including weak mutation, compiler/interpreter optimizations, and distributed execution. This work focuses specifically on cost reduction by reducing the number of mutations actually executed in each development step. [7]

3. Proposal

Given the high computational cost of mutation testing, this work proposes a Git-based optimization technique and evaluates its effectiveness in an agile development setting. Unlike approaches that run mutations across the whole codebase every cycle, the proposed method reduces introduced mutations by using source-history evolution, mutating only work in progress.

3.1 Associated Technologies

ECMAScript (JavaScript) was chosen because it is a dominant language in web runtimes and open-source projects. Git was selected as the VCS due to widespread adoption, and Stryker Mutator was selected as the mutation-testing foundation for JavaScript. [10] [11] [12] [18]

Stryker report example — Figure 2. Example report generated by Stryker Mutator for a JavaScript file.

The technique was implemented as an extension to Stryker capabilities, applying mutations only to regions identified by Git as newly changed code. In Git diffs, removed content appears in red and added content appears in green. The proposal targets mutation analysis on those newly introduced regions.

Git diff example — Figure 3. Example Git diff for a new change in a versioned codebase.

3.2 Performance Measurement

To evaluate the technique, mutation adequacy evolution was compared throughout incremental feature introduction. Since Git stores change history, the following metrics were observed:

Mutation test execution time (in a controlled environment);
Full-codebase mutation score;
Diff-only mutation score (new changes only);
Number of newly introduced mutants missed when mutating only diff regions.

Environmental parameters were also controlled: processor and memory details, plus operating system configuration. A GCP virtual machine was used with a standardized setup:

CPU model: Intel(R) Xeon(R) CPU @ 2.00GHz;
CPU clock frequency: 1,999.999 MHz;
CPU cache size: 39424 KB;
Available cores: 2;
Architecture: amd64;
RAM: 7,653.484 MB;
Operating system: Debian GNU/Linux 10 (buster).

To reduce interference, Stryker process concurrency was limited to one process, leaving one core available for operating-system background tasks.

4. Implementation

To make the technique practical, an extension was built for Stryker Mutator. Additional software-testing techniques were also used to validate the extension itself.

4.1 Building the Stryker Extension

The implementation started by analyzing raw output from git diff. This output is a concatenation of sections, one per changed file. Each file section starts with a header that identifies source and destination paths, including file moves and deletions. Deleted files are represented using /dev/null.

diff --git a/src/git-checker.ts b/src/git-checker.ts
index 824468b..726ac43 100644
--- a/src/git-checker.ts
+++ b/src/git-checker.ts
@@ -71,7 +71,7 @@ export class GitChecker implements Checker {
      exec(GIT_DIFF_COMMAND, (error, stdout, stderr) => {
        if (error) {
          this.logger.error(stderr);
-          this.logger.fatal("Error while executing the Git command.");
+          this.logger.fatal(`Error while executing the \`${GIT_DIFF_COMMAND}\` command.`);
          reject(error);
        }

After the file header, diff chunks appear with chunk headers such as:

@@ -1,7 +1,7 @@

Chunk lines follow a prefix convention:

Leading space: unchanged context (for human reading);
-: line removed or replaced;
+: line added or replacing a previous line.

Programmatic execution used git diff --color=never HEAD^ HEAD to avoid terminal color codes and compare latest and previous commits. For local staged development, support was also provided for git diff --color=never --cached.

Helper functions parsed headers and chunk lines with regular expressions. They produced line ranges indicating altered source regions, and another helper checked whether each generated mutant intersects those ranges.

const DIFFS_SPLIT_REGEX = new RegExp("diff\\s--git\\s.+\\n", "g");
const FILE_PATH_REGEX = new RegExp("[+]{3}\\sb\\/.+", "g");
const CHUNKS_MATCH_REGEX = new RegExp(
  "@@\\s-\\d+(,\\d+)?\\s\\+\\d+(,\\d+)\\s@@.*\\n(.|\\n(?!(@@\\s-\\d+(,?\\d+)?\\s\\+\\d+(,\\d+)\\s@@)))+",
  "gm"
);
const CHUNK_HEADER_MATCH_REGEX = new RegExp("@@[^@]+@@", "g");
const COMMA_OR_SPACE_REGEX = new RegExp("[,|\\s]", "g");
const CHUNK_HEADER_REMOVAL_REGEX = new RegExp(
  "@@\\s-\\d+(,\\d+)?\\s\\+\\d+(,\\d+)\\s@@.*\\n",
  "gm"
);

With these building blocks, the extension was integrated into Stryker's JavaScript package via framework interfaces. Auxiliary algorithms for diff parsing and intersection checks were documented in the thesis annexes.

4.2 Testing and Quality Assurance (QA)

After implementation, correctness was evaluated using automated unit tests and exploratory manual tests from the perspective of a library user.

4.2.1 Unit Tests

A unit-test suite was created to verify helper-function outputs and provide regression safety for future maintenance. Inputs included valid outputs from git diff --color=never HEAD HEAD^ in real repositories, while intentionally invalid inputs were used to validate expected failure behavior.

The following summarized test cases were used:

Test Case	Input	Expected Output
Parse valid chunk	Chunk from valid diff output	Correct start/end positions for changed range
Parse invalid chunk	Empty chunk	Failure is raised
Parse complete diff	Full diff output	Correct mapping between file names and changed ranges
Handle deleted files	Diff containing deleted file	Deleted file is ignored without crashing
Handle moved/renamed files	Diff containing moved file	Treated consistently as modified source
Mutant outside diffs	Diff plus mutant outside changed range	Detected as out-of-range
Mutant near boundary	Diff plus mutant near changed boundary	Detected as out-of-range
Boundary intersection	Mutant intersects exactly at diff boundary	Detected as in-range
Mutant fully inside	Mutant fully contained in changed range	Detected as in-range

4.2.2 Exploratory Tests

Due to limited public integration-testing tooling from Stryker for this plugin context, exploratory tests were preferred for end-to-end validation. The objective was to verify that mutants were generated only for modified code regions and that overhead remained acceptable.

Because nesting Git repositories is inconvenient, a separate Git-managed project was created and the plugin library was imported remotely. This project implemented Conway's Game of Life and was used to simulate real development through a sequence of small commits. [17]

Conway's Game of Life in browser — Figure 4. Conway's Game of Life running in a web browser (white cells represent live state).

During the exploratory flow, mutation reports were inspected after each commit to verify alignment with commit diffs. Infrequently, some expected mutants were absent in modified regions relative to full analysis, but no cases were observed where mutants were generated outside changed source intervals.

Across the development flow, all mutants found in rapid analysis were included in the complete-analysis mutant set. This validated the software for its intended purpose and showed no exaggerated processing overhead from the implemented relevance-detection algorithm.

5. Results

As described in implementation, a separate project was used both for exploratory testing and for performance measurements. The measurement environment was chosen to reduce interference from unrelated system activity and include only required project dependencies.

5.1 Computing-Environment Preparation

A dedicated GCP virtual machine was created for obtaining performance and adequacy results. The configured parameters are shown below. Test execution was controlled over SSH.

Figure 5. VM configuration with 2 processing cores and 7.5 GB of memory.

Figure 6. Disk and operating system configuration used in the VM.

# Compare only the most recent commit against the previous one
export STRYKER__GIT_COMPARISON="HEAD^ HEAD"

# Mutation test restricted to changed regions, concurrency 1
# getRanges.js uses the plugin to compute mutated ranges
yarn stryker run --concurrency=1 --mutate $(node scripts/getRanges.js)

# Full mutation test over entire codebase, concurrency 1
yarn stryker run --concurrency=1

Before running measurements, VM CPU and memory usage in idle state were verified with htop to ensure no significant concurrent workload would bias execution-time measurements.

Idle VM resource usage — Figure 7. Resource usage verification on idle VM.

5.2 Execution-Time Comparison

As expected, execution time dropped significantly when mutation analysis was restricted to commit modifications instead of the whole codebase. By the end of the ninth analyzed commit, linear-regression estimates from collected measurements indicated an approximately 84% reduction in execution time.

Full mutation-testing runtime followed a growing profile as the codebase evolved, whereas diff-restricted runtime remained approximately constant. In this Game of Life project, rapid-analysis runtime averaged 86.34 seconds, with a standard deviation of 38.30 seconds, which is consistent with natural variation in the amount of code introduced per commit.

5.3 Mutation-Score Differences

Beyond runtime, effectiveness relative to complete analysis was also measured. The number of generated mutants and mutation scores were observed in both modes as commit history advanced.

Mutant count comparison — Figure 9. Mutant-count comparison between full mutation analysis and diff-restricted mutation analysis across commit evolution.

Mutation score comparison — Figure 10. Mutation-score comparison between full mutation analysis and diff-restricted mutation analysis across commit evolution.

The number of mutants in diffs followed the variation pattern seen in full execution. Small deviations were associated with deletions/edits of existing code and with boundary cases where mutations occurred exactly at diff limits. Boundary-case losses were linked to Stryker implementation details outside the plugin's direct control, were uncommon, and did not substantially compromise the technique. In the evaluated sequence, only the third and fifth commits showed such losses (4 and 1 mutants, respectively). One mitigation would be expanding diff boundaries slightly.

Regarding scores, higher diff-level scores than the previous global score were associated with growth in global score, while lower diff-level scores were associated with decreases. This supports the technique's utility for continuous improvement cycles, where local quality of newly introduced code directly affects broader quality trends.

6. Final Remarks

Results from the plugin proof of concept indicate that restricting mutations to regions being changed is a promising way to integrate mutation testing into everyday software-engineering workflows. The approach reduced analysis time without major loss in adequacy signal.

The developed technique complements traditional mutation testing by effectively narrowing analysis to incremental software changes. However, because older unchanged code is not re-analyzed in rapid mode, faults associated with previously ignored mutants may remain undetected. Therefore, full mutation testing remains necessary for rigorous system-wide auditing. [6]

Another scenario where the technique is not suitable is when changes are made only to test files, since no source-code mutants are generated for unchanged production code. Large automated-test refactorings still require full mutation-test re-execution.

6.1 Applicability in Agile Contexts

The rapid-mutation technique aligns with agile principles by enabling analysis on small software increments and supporting early, continuous delivery of valuable software. Since working software is the primary progress measure in agile development, a tool that allows mutation analysis in shorter iterations helps teams detect quality risks sooner, instead of discovering large sets of surviving mutants only late in development. [3]

6.2 Future Work

Future work includes extending the technique to additional programming-language ecosystems, including compiled and purely functional languages, to evaluate performance impact in other runtime models. Another direction is integrating the approach into automated code review and CI/CD pipeline stages, producing alerts when diff-level mutation scores fall below defined quality thresholds.

To support future evolution, the extension source code was published in a Git repository and the compiled package was published on NPM, making adoption and external contribution easier. Open publication also helps quality by allowing broader bug reporting and feature requests. [18] [19]

References

Jorgensen, P. C. Software Testing: A Craftsman's Approach. Taylor & Francis Group, 2014.
Martin, R. C. Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education, 2009.
Agile Manifesto. Principles behind the Agile Manifesto, 2001. https://agilemanifesto.org/principles.html
Jia, Y.; Harman, M. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering, 37(5):649-678, 2011. DOI: 10.1109/TSE.2010.62
Morell, L. J. A Theory of Fault-Based Testing. IEEE Transactions on Software Engineering, 16(8):844-857, 1990. DOI: 10.1109/32.57623
DeMillo, R. A.; Lipton, R. J.; Sayward, F. G. Hints on Test Data Selection: Help for the Practicing Programmer. Computer, 11(4):34-41, 1978. DOI: 10.1109/C-M.1978.218136
Howden, W. E. Weak Mutation Testing and Completeness of Test Sets. IEEE Transactions on Software Engineering, SE-8(4):371-379, 1982. DOI: 10.1109/TSE.1982.235571
Software Freedom Conservancy. Git, 2005. https://git-scm.com
Zolkifli, N. N.; Ngah, A.; Deraman, A. Version Control System: A Review. Procedia Computer Science, 135:408-415, 2018. DOI: 10.1016/j.procs.2018.08.191
Ecma International. ECMAScript 2022 Language Specification, 2021. https://tc39.es/ecma262/
Mozilla. What JavaScript implementations are available?, 2021. MDN JavaScript implementations
Bissyandé, T. F.; Thung, F.; Lo, D.; Jiang, L.; Réveillère, L. Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects. IEEE COMPSAC, 303-312, 2013. DOI: 10.1109/COMPSAC.2013.55
Mathur, A. P.; Wong, W. E. An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1):9-31, 1994. DOI: 10.1002/stvr.4370040104
Frankl, P. G.; Weiss, S. N.; Hu, C. All-uses vs mutation testing: An experimental comparison of effectiveness. Journal of Systems and Software, 38(3):235-253, 1997. DOI: 10.1016/S0164-1212(96)00154-9
Ji, C.; Chen, Z.; Xu, B.; Zhao, Z. A Novel Method of Mutation Clustering Based on Domain Analysis. SEKE 2009, 422-425, 2009.
Offutt, A. J.; Rothermel, G.; Zapf, C. An experimental evaluation of selective mutation. ICSE, 100-107, 1993. DOI: 10.1109/ICSE.1993.346062
Berlekamp, E. R.; Conway, J. H.; Guy, R. K. Winning Ways For Your Mathematical Plays, Volume 2. CRC Press, 2018.
Toma, L. B. Stryker Git Checker Repository, 2021. https://github.com/lbtoma/stryker-git-checker
Toma, L. B. Stryker Git Checker Package, 2021. https://www.npmjs.com/package/stryker-git-checker