This repository contains artefacts for the paper "Schema Matching with Large Language Models".
To generate results and integrate them, make sure you have included the fm_matcher
submodule. This should work automatically when cloning this repository. Afterwards, you need to run those notebooks in order:
generate_
notebooks to generate the results. Note that these notebooks use a toolfm_matcher
that accesses the OpenAI API, which needs to be payed. This is why we include the results in theresults
folder.integration_decisions.ipynb
to integrate the results (all_decisions_df.csv
)
Afterwards, the integrated decisions can be analyzed with any of the remaining notebooks.
benchmark_overview.ipynb
gives an overview of the datasets and the ground truthquality_
notebooks generate the results with regard to (median) results qualitycomplementarity_
notebooks investigate (average) results when combining two prompting approaches
Under benchmark
, you can find the ground truth of matches that we aim to find using LLMs. In the ground_truth.csv
file, each line corresponds to a match. type
defines the type of match (currently only one_to_one
), source
is the full qualified name of the source attribute, relationship
is currently always corresponds
, and target
is the full qualified name of the target attribute.
results
contain our experimental results. gpt35_results.csv
, gpt4_results.csv
and baseline_results.csv
contain the actual votes per attribute pair; all_decisions_df.csv
is an integrated version of the three files where majority voting has been applied.
The schema_docuemntations
folder contains the relation and attribute descriptions we used for our experiments. The MIMIC desciptions are scraped from the MIMIC-IV documentation, the OMOP documentations stems from the OMOP CDM documentation. Refer to the corresponding documentation sites for more information and their publication licenses.
The prompt templates that we applied to generate prompts for our experiments, in JSON format.
Dependencies are listed in requirements.txt
. Install them via pip install -r requirements.txt
.