The CyberMetric Dataset introduces a new benchmarking tool consisting of 10,000 questions designed to evaluate the cybersecurity knowledge of various Large Language Models (LLMs) within the cybersecurity domain. This dataset is created using different LLMs and has been verified by human experts in the cybersecurity field to ensure its relevance and accuracy. The dataset is compiled from various sources including standards, certifications, research papers, books, and other publications within the cybersecurity field. We provide the dataset in four distinct sizes —small, medium, big and large— comprising 80, 500, 2000 and 10,000 questions, respectively.The smallest version is tailored for comparisons between different LLMs and humans. The CyberMetric-80 dataset has been subject to testing with 30 human participants, enabling an effective comparison between human and machine intelligence.
The CyberMetric paper "CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge" has been accepted for publication in the 2024 IEEE International Conference on Cyber Security and Resilience (IEEE CSR 2024).
IEEE Xplore link: https://ieeexplore.ieee.org/document/10679494
Cite the paper:
@INPROCEEDINGS{10679494,
author={Tihanyi, Norbert and Ferrag, Mohamed Amine and Jain, Ridhi and Bisztray, Tamas and Debbah, Merouane},
booktitle={2024 IEEE International Conference on Cyber Security and Resilience (CSR)},
title={CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge},
year={2024},
volume={},
number={},
pages={296-302},
keywords={Accuracy;Reverse engineering;Benchmark testing;NIST Standards;Risk management;Problem-solving;Computer security},
doi={10.1109/CSR61664.2024.10679494}}
The CyberMetric dataset was created by applying different language models using Retrieval-Augmented Generation (RAG), with human validation included in the process. The AI-driven generation framework is illustrated in the following figure.
We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric dataset
We have developed a compact Python script called CyberMetric_evaluator.py
to showcase how to utilize the Dataset with OpenAI GPT. Simply insert your API key in the script by setting API_KEY="<YOUR-API-KEY-HERE>"
, and then execute the evaluator program.
Here's an example output generated by the script using the CyberMetric-80 dataset: