CyberBench: A Multi-Task Benchmark for Evaluating Large Language Models in Cybersecurity

Zefang Liu, Jialei Shi, John F. Buford

AAAI 2024 Workshop on Artificial Intelligence for Cyber Security (AICS), 2024

Abstract

We present CyberBench and CyberInstruct, two innovative tools designed to enhance the application of large language models (LLMs) in the cybersecurity field. Firstly, CyberBench is a domain-specific multi-task benchmark tailored for assessing LLM performance in cybersecurity-related tasks. As the first benchmark suite for LLMs in cybersecurity, CyberBench fills a crucial gap in the current practice by providing a general and consistent approach and addressing coverage limitations of prior language model evaluations in this domain. We showcase the results of using CyberBench to evaluate more than ten generative LLMs. Secondly, CyberInstruct is a family of generative LLMs produced through instruction-tuning from open LLMs with a cybersecurity corpus. Experimental results of CyberInstruct exhibit comparable performance to large proprietary LLMs in the cybersecurity domain , underscoring the effectiveness of our fine-tuning strategy. Our work contributes to the understanding of LLMs’ potential in cybersecurity and establishes a solid foundation for future research and development.

Recommended citation: Liu, Zefang, Jialei Shi, and John F. Buford. "Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity." AAAI 2024 Workshop on Artificial Intelligence for Cyber Security (2024).
[Download Paper] [Download Slides] [Download Code] [Download Data]