Trustworthy and Responsible AI-Centric Test Engineering

Aim of the project

AI and ML support for the testing process - commonly delivered either by fully or partially autonomous software agents or as part of an interactive human-in-the-loop process (e.g., where developers and the tool “chat” with each other) - could lead to major reductions in cost and  improvements in the resulting quality of the complex software that powers our society.

However, there are also risks in improperly integrating AI/ML tools into the testing process such as unpredictability and a lack of explainability. Given the mission-critical nature of tested systems, these tools must be effective, transparent, unbiased, and aligned with human needs and requirements.

Goal

The goal of this project is to explore, investigate, implement, and evaluate a trustworthy, human-centric AI-driven testing process. Our vision is that such a process:

  • Uses AI and ML to augment, rather than replace, developers.
  • Enables developers to make effective, data-driven decisions during testing.
  • Offers trustworthy, repeatable, and explainable results.

What has been achieved

The project currently consists of six core investigations:

1 LLM-enabled log analysis

We are exploring the use of LLM-based log summarization as part of the debugging workflow at Volvo Cars. The initial results of this project contribute insights into the integration of AI-driven tools within industrial software engineering processes, highlighting both the benefits and challenges of deploying LLMs in real-world fault analysis scenarios.

2 Benchmarking GenAI for Test Artifact Quality

We explored using LLMs to generate and augment test suites for industrial control software, aiming to improve coverage and fault detection. Across three strategies, generation, amplification by addition, and amplification by modification, addition of new test cases (especially with few-shot prompting) improved fault detection, while pure generation and modification produced limited gains. Overall, the results suggest that LLMs can meaningfully support automated testing in industrial settings.

Together with Bosch, we design and evaluate practical metrics for LLM-based testing solutions, especially for AI-assisted creation of structured test specifications, before adopting them in SE project work.

3 Exploring effective integration of automated testing tools into developer workflow

We are currently conducting two case studies at Zenseact related to this topic. In the first, we are exploring how test diversity measurements and visualizations can help developers identify redundant testing efforts and gaps in testing efforts. We will also explore how test diversity measurements can help developers assess whether requirements have been appropriately covered during testing, and how test diversity can help developers argue that their code meets safety standards in the automotive industry.

In the second study, we are characterizing practices, tooling, knowledge, challenges, gaps, and solutions related to integration testing in the modern automotive industry. The traditional view of integration testing is simplistic and misses nuances related to automotive development, where there are multiple types and levels of software being integrated, as well as different forms of hardware. Each of these aspects may involve different testing methods and safety requirements.

4 Investigation of trust and ethical challenges in testing and test automation

We are investigating ethical challenges in AI-centric software testing. The study provides practical guidance for ethical AI use in testing with regards to nine themes: human control and responsibility, justice and fairness, explainability, logging and monitoring, human-centric risks, stakeholder involvement, impact on human relations, technical risks, sustainability and waste, and privacy.

5 Investigation of cognitive and human factors to improve software testing

We are investigating software and system development dynamic and static testing practices in collaboration with Siemens. This work is informed by a broader conceptual exploration of cognitive problem solving, biases and occupational folklore in software development and automation, emphasizing how informal knowledge, cultural narratives, and shared team practices shape professional work and test engineering. We develop a practical review and test engineering support approach that makes implicit folklore knowledge explicit and mitigates confirmation bias, groupthink, and overconfidence through structured checklists, evidence-based review criteria, and workflows.

6 Integration of visualization and analysis techniques to support passive testing

We have investigated a requirements and test specs bad smell detector analyzer and visualization tool integrated into GitHub Actions for CI/CD use, we developed frontends for tools supporting developers of industrial systems through automated checking of software design rules using DRACONIS and we have evaluated some of the capabilities of the visual passive testing NAPKIN Studio tool together with Volvo CE. The visualizations helped engineers detect requirement violations and assess test coverage in real-world scenarios. Results showed that passive testing could test requirements in parallel when supported by clear logging and specific test logic. Stakeholder feedback confirmed its potential to complement verification workflows through improved traceability and usability.

Participating companies (2026)

  • Siemens
  • Bosch
  • Volvo Cars
  • Zenseact
  • Ericsson
  • Grundfos

Participating researchers

 

Publications

  • Eduard Enoiu, Jean Malm, and Gregory Gay. "Folklore in Software Engineering: A Definition and Conceptual Foundations." To appear, International Conference on Cooperative and Human Aspects of Software Engineering (CHASE 2026). Pre-print available from https://arxiv.org/pdf/2601.21814
  • Anton Ekström, Hampus Rhedin Stam, Francisco Gomes de Oliveira, Gregory Gay, and Sabina Edenlund. "From Logs to Lessons: An Exploration of LLM-based Log Summarization for Debugging Automotive Software." To appear, International Conference on Automation of Software Test (AST 2026). Pre-print available from https://greg4cr.github.io/pdf/26llmlog.pdf
  • Strandberg, Per Erik, Eduard Paul Enoiu, and Mirgita Frasheri. "Ethical challenges and software test automation." AI and Ethics 5, no. 6 (2025): 6185-6206. https://doi.org/10.1007/s43681-025-00804-7
  • Nicaj, Aleksandra, Daniel Flemström, Eduard Enoiu, and Wasif Afzal. "Passive Testing of Vehicular Embedded Systems: An Industrial Case Study with T-EARS and Napkin Studio" In IFIP International Conference on Testing Software and Systems, pp. 290-306. Springer, 2025. https://doi.org/10.1007/978-3-032-05188-2_19
  • Barrett, Ayodele A., Eduard Enoiu, and Wasif Afzal. "Gaps in software testing education: A survey of academic courses in Sweden" In IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T), pp. 108-117. IEEE, 2025.
  • Eduard Enoiu, Nasir Mehmood Minhas, Michael Felderer, and Wasif Afzal. "Automated Test Generation: Taxonomy and Tool Applications." In International Conference on Fundamentals of Software Engineering, pp. 27-41. Cham: Springer Nature Switzerland, 2025. https://doi.org/10.1007/978-3-031-87054-5_3
  • Abbas, Muhammad, Mehrdad Saadatmand, Eduard Enoiu, and Bernd-Holger Schlingloff. "State of Test Optimization for Variability in Industry" In International Conference on Information Technology-New Generations, pp. 528-542. Springer, 2025.
  • Stefan Alexander Van Heijningen, Theo Wiik, Francisco Gomes de Oliveira Neto, Gregory Gay, Kim Viggedal, and David Friberg. "Integrating mutation testing into developer workflow: An industrial case study." In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pp. 2110-2120. 2024.
  • Jingxiong Liu, Ludvig Lemner, Linnea Wahlgren, Gregory Gay, Nasser Mohammadiha, and Joakim Wennerberg. "Exploring the integration of large language models in industrial test maintenance processes." arXiv preprint arXiv:2409.06416 (2024).