EDA Corpus

EDA Corpus is a data corpus for Electronic Design Automation (EDA) Large Language Model (LLM) research. In particular, the datapoints are tailored to OpenROAD and OpenROAD-flow-scripts. The corpus contains datasets for both question-answering and prompt-scripting for OpenROAD to improve user productivity.

Background

Recent works have shown that LLMs have tremendous potential in the chip design area in terms of writing code/script for hardware description language (HDL) or electronic design automation (EDA) flow. However, many of these works rely on data which are not publicly available and/or not permissively licensed for use in LLM training and distribution, especially in the EDA domain. To foster research in LLM-assisted physical design, we introduce EDA Corpus, a curated dataset for physical design automation tasks. EDA Corpus is based on OpenROAD, a widely utilized open-source EDA tool for automated place and route tasks. Leveraging OpenROAD mitigates obstacles associated with proprietary EDA tools, enabling the public release of our dataset and facilitating its use with LLMs without licensing constraints.

Dataset Description

EDA Corpus consists of two types of data: (1) question-answer (QA) pairs and (2) prompt-script (PS) pairs

Dataset	Description	Non-augmented	Augmented*
`eda-corpus-qa-v1`	Question-answer	198 data pairs	590 data pairs
`eda-corpus-ps-v1`	Prompt-script	395 data pairs	943 data pairs
`eda-corpus-v1`	Combined QA/PS	593 data pairs	1533 data pairs

* The augmented dataset is a superset of non-augmented

Question-answer (QA) dataset

Contains pairs of question prompts and prose answers which are collected from The OpenROAD Project's GitHub issues, discussions, and documentation
Datapoints are categorized into three categories: OpenROAD general, OpenROAD tool, and OpenROAD flow
CSV file format and Microsoft Excel file format provided

An example question-answer pair: Question:

What does the SKIP_PIN_SWAP variable in Clock Tree Synthesis indicate?

Answer:

Do not use pin swapping as a transform to fix timing violations (default: use pin swapping)

Augmentation

The augmented dataset includes data pairs formed through paraphrasing questions and answers in order to enhance semantic diversity.

Prompt-script (PS) dataset

Contains pairs of scripting prompts and OpenROAD Python scripts
Data points are categorized into two three categories: flow scripts and database (DB) scripts
CSV file format and Microsoft Excel file format provided

While Tcl is the normal interface for OpenROAD, leveraging Python allows the reuse of pretrained LLMs for Python code generation. It is worth noting that Python code examples are significantly more prevalent, hence the focus on Python-based scripts.

An example prompt-script pair: Prompt:

Show me how I can read a Verilog file into OpenROAD.

Script:

Augmentation

While each augmented datapoint is distinct, the augmented points may perform similar functions with script parameter variations. For instance, the augmented set has a few instances of gate sizing, and the sizing is different between datapoints. The data is augmented through two methods:

Paraphrasing prompts: prompts are paraphrased to increase semantic diversity.
Variable and parameter changes: pairs are duplicated with changes to the prompt parameters and script variable names.

from openroad import Tech, Design
from pathlib import Path

tech = Tech()
# Make sure you have .lef files read into OpenROAD DB
design = Design(tech)

designDir = Path("design_path")
design_file_name = "design_filename"
design_top_module_name = "design_top_module_name"
verilogFile = designDir/str(design_file_name + ".v")
design.readVerilog("verilogFile")
design.link(design_top_module_name)

Citing This Work

If you use this corpus in your work, please use the following citation:

@inproceedingsV{wu2024eda,
  title        = {EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD},
  author       = {Wu, Bing-Yue and Sharma, Utsav and Kankipati, Sai Rahul Dhanvi and Yadav, Ajay and George, Bintu Kappil and Guntupalli, Sai Ritish and Rovinski, Austin and Chhabria, Vidya A.},
  booktitle    = {The First IEEE International Workshop on LLM-Aided Design (LAD'24)},
  month        = {June},
  year         = {2024},
  organization = {IEEE},
  address      = {New York, NY}
}

Taxonomy

The question-answer and prompt-script data should be individually referred to as "datasets". The two dataset combined should be referred to as a "corpus".

Citations to this work can be mentioned by corpusName-datasetName-version:

Data	Name
All of EDA Corpus	`eda-corpus-v1`
Only question-answer dataset	`eda-corpus-qa-v1`
Only prompt-script dataset	`eda-corpus-ps-v1`

Additionally, you can mention whether you use the augmented or non-augmented versions. For example, "We train on the augmented eda-corpus-ps-v1 dataset."

License

EDA Corpus is licensed under a Creative Commons Attribution 4.0 International License. If you use EDA Corpus in a published scholarly work, please use the above citation. If you use EDA Corpus in another publication such as an article or blog post, please include a link to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Augmented_Data		Augmented_Data
Non-Augmented_Data		Non-Augmented_Data
OpenROAD @ 69430cd		OpenROAD @ 69430cd
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDA Corpus

Background

Dataset Description

Question-answer (QA) dataset

Augmentation

Prompt-script (PS) dataset

Augmentation

Citing This Work

Taxonomy

License

About

Releases

Packages

Contributors 4

License

OpenROAD-Assistant/EDA-Corpus

Folders and files

Latest commit

History

Repository files navigation

EDA Corpus

Background

Dataset Description

Question-answer (QA) dataset

Augmentation

Prompt-script (PS) dataset

Augmentation

Citing This Work

Taxonomy

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages