Continuous quality assurance of AI/ML Software

Under construction.

Aim of the project

Digital transformation is introducing new types of artifacts and development practices into software and cyber physical systems. AI/ML components and LLM generated code are now used throughout the system lifecycle—shaping requirements, implementing components, monitoring system behavior, and automating development tasks.

However, AI/ML software behaves differently from traditional software and introduces new types of dependencies, failure modes, and quality concerns. Modern systems must therefore integrate heterogeneous components—human written code, ML models, and automatically generated code—while preserving reliability, transparency, and maintainability.

Goal

To provide experiment results, research prototypes and insights that will help companies working with AI/ML software and AI-generated software to develop their capability of continuously assuring the quality of their systems.

What has been achieved

The work consists of three PhD projects and one associated PhD project conducted at Linköping University in collaboration with Software Center, CoDig, and WASP.

P1-1: Early Crash Detection for ML Code

Researcher: Yiran Wang

Problem: Machine learning (ML) development often takes place in Jupyter notebooks, which support interactive execution and provide rich runtime feedback. However, ML notebooks are prone to crashes that may leave the kernel in an inconsistent state. Restoring a correct kernel state typically requires restarting the kernel and re-executing all prior cells, thereby reducing developer productivity. This makes early crash detection essential. Traditional static analysis relies solely on static code, which limits its ability to detect crashes early in ML notebooks, where many crashes are related to runtime data attributes.

Approach: This project analyzes crashes in ML notebooks and develops LLM-based crash detection and diagnosis techniques that combine runtime information extracted from the current notebook kernel state with static code. These techniques aim to detect crashes before a target cell is executed in ML notebooks.

Results

A new idea paper demonstrating the effectiveness of runtime information in the notebook kernel, accepted at FSE [YIR24a].
An empirical study analyzing real crash data from ML notebooks on GitHub and Kaggle, accepted to IEEE Transactions on Software Engineering [YIR24b].
Development of JunoBench, a benchmark dataset of real-world ML notebook crashes [YIR24c].
Proposed CRANE-LLM, a runtime information augmented LLM-based approach for early crash detection and diagnosis in ML notebooks.

P1-2: Quality Assurance in AI Assisted Software Development

Researcher: Xin Sun

Problem: Large Language Models increasingly generate production code, but the non-functional quality characteristics (NFQCs)—performance, maintainability, security, etc.—of such code are poorly understood and vary widely.

Approach: This project investigates:

the NFQCs of LLM generated code,
tradeoffs between different quality properties,
ways to integrate LLMs as part of the maintenance and QA pipeline.

Results

A major article submitted to the Journal of Systems and Software, based on a literature review, two workshops, and an empirical industrial study.
New insights into NFQC trade offs, highlighting the need for continuous monitoring of generated code during development and maintenance.

P1-3: Quality Assurance of Generative ML Techniques for Multi-Disciplinary Simulations

Researcher: Masoud Sadrnezhaad

Problem: Generative ML can help construct complex simulations used to verify and validate large cyber physical systems. However, simulations must respect domain specific constraints across multiple disciplines (mechanical, control, communication, etc.), and increasing model size/complexity makes modeling errors (e.g., syntax errors, inconsistent equations, singularities) hard to diagnose and fix.

Approach: This project develops:

methods to diagnose and localize errors in simulation models,
generate model-level explanations of identified errors aligned with modeller intent,
techniques for incremental integration of agentic LLM-based debugging and automated repair into existing engineering practices.

The work leverages agentic LLMs and fine-tuned open-source language models to analyze simulation-model artifacts in different levels of abstraction (e.g., flattened equations and model hierarchies) together with compiler/simulation results and logs, and to synthesize corrective patches.

Results

A literature-based study and cross company workshop published at PROFES 2025 [MAS25].
Follow up workshops with Saab (June 2025) and Volvo Cars (August 2025).
An approach for automated collection of a dataset of Modelica model repairs mined from well-known Modelica libraries.

Associated PhD Project: Continuous Quality Assurance Methods for Engineering ML Software

Researcher: Willem Meijer

Problem: ML pipelines contain many potential sources of functional errors—misused hyperparameters, incompatible algorithms, dataset mismatches—yet many faults remain undetected until late stages.

Approach: This project develops data informed static analysis to automatically identify functional faults in ML pipelines as they are being developed.

Results

A functional prototype tool for ML engineers performing early-stage QA during pipeline construction.
A short paper describing the tool was accepted as a New Ideas paper at ICSE 2026 [MEJ26].

Participating companies (2026)

Saab
Ericsson
Volvo Cars
AB Volvo
Scania
Advenica

Participating researchers

Dániel Varró, professor, LiU
Kristian Sandahl, professor emeritus, LiU
Yiran Wang, PhD student, LiU
Xin Sun, PhD student, LiU
Masoud Sadrnezhaad, PhD student, LiU
Willem Meijer, PhD student, LiU

Publications

[MAS25] Masoud Sadrnezhaad, José Antonio Hernández López, Torvald Mårtensson, and Dániel Varró. Generative AI in Simulation-Based Test Environments for Large-Scale Cyber-Physical Systems: An Industrial Study. In Product-Focused Software Process Improvement (PROFES 2025), LNCS, vol. 16361, Springer, Cham, 2026, doi: 10.1007/978-3-032-12089-2_13.

[MEJ26] Willem Meijer, Kristian Sandahl, and Dániel Varró. Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics- Forthcoming publication

[YIR24a] Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. Using Run-time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’24), 2024, Ipojuca, Brazil, doi: 10.1145/3663529.3663785.

[YIR24b] Yiran Wang, Willem Meijer, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks. In IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2025.3574500.

[YIR24c] Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks. arXiv preprint arXiv:2510.18013 (2025).