Skip to content

cybermetric/CyberMetric

Repository files navigation

CyberMetric Dataset

logo

Description

The CyberMetric Dataset introduces a new benchmarking tool consisting of 10,000 questions designed to evaluate the cybersecurity knowledge of various Large Language Models (LLMs) within the cybersecurity domain. This dataset is created using different LLMs and has been verified by human experts in the cybersecurity field to ensure its relevance and accuracy. The dataset is compiled from various sources including standards, certifications, research papers, books, and other publications within the cybersecurity field. We provide the dataset in four distinct sizes —small, medium, big and large— comprising 80, 500, 2000 and 10,000 questions, respectively.The smallest version is tailored for comparisons between different LLMs and humans. The CyberMetric-80 dataset has been subject to testing with 30 human participants, enabling an effective comparison between human and machine intelligence.

Cite

The CyberMetric paper "CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge" has been accepted for publication in the 2024 IEEE International Conference on Cyber Security and Resilience (IEEE CSR 2024).

IEEE Xplore link: https://ieeexplore.ieee.org/document/10679494

Cite the paper:

@INPROCEEDINGS{10679494,
  author={Tihanyi, Norbert and Ferrag, Mohamed Amine and Jain, Ridhi and Bisztray, Tamas and Debbah, Merouane},
  booktitle={2024 IEEE International Conference on Cyber Security and Resilience (CSR)}, 
  title={CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge}, 
  year={2024},
  volume={},
  number={},
  pages={296-302},
  keywords={Accuracy;Reverse engineering;Benchmark testing;NIST Standards;Risk management;Problem-solving;Computer security},
  doi={10.1109/CSR61664.2024.10679494}}

Architecture

The CyberMetric dataset was created by applying different language models using Retrieval-Augmented Generation (RAG), with human validation included in the process. The AI-driven generation framework is illustrated in the following figure. Framework

LLM Leaderboard on CyberMetric Dataset

We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric dataset

result

Usage

We have developed a compact Python script called CyberMetric_evaluator.py to showcase how to utilize the Dataset with OpenAI GPT. Simply insert your API key in the script by setting API_KEY="<YOUR-API-KEY-HERE>", and then execute the evaluator program.

Here's an example output generated by the script using the CyberMetric-80 dataset:

output

About

CyberMetric dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages