TestEval: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang1*, Chenyuan Yang2*, Zhijie Wang1*, Yuheng Huang3, Zhaoyang Chu4,
Da Song1, Lingming Zhang2, An Ran Chen1, Lei Ma3, 1
1University of Alberta, 2University of Illinois Urbana-Champaign, 3The University of Tokyo,
4Huazhong University of Science and Technology
github paper

🏆 TestEval Leaderboard

🙋 How to interpret the results?

  1. Overall coverage denotes the line/branch coverage by generating N test cases.
  2. Coverage@k denotes the line/branch coverage by using only k out of N test cases.
  3. Target line/branch/path coverage denotes the accuracy of covering specific line/branch/path by instruction.
  4. Baseline denotes the accuracy of covering specific line/branch/path without instruction.
  5. "Size" here is the amount of activated model weight during inference.
  6. 💚 means open weights and open data. 💙 means open weights and open SFT data, but the base model is not data-open. What does this imply? 💚💙 models open-source the data such that one can concretely reason about contamination.

📖 BibTeX

@inproceedings{wang2025testeval,
title={TESTEVAL: Benchmarking Large Language Models for Test Case Generation},
author={Wenhan Wang and Chenyuan Yang and Zhijie Wang and Yuheng Huang and Zhaoyang Chu
and Da Song and Lingming Zhang and An Ran Chen and Lei Ma},
booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025},
year={2025}
}
The website template was borrowed from EvalPlus.