This repository presents team QMUL-SDS's participation on SCIVER shared task. Compared to the baseline system, we achieve substantial improvements on the dev set. As a result, our team is the No. 4 team on the leaderboard at the time.
We propose an approach that performs scientific claim verification by doing binary classifications step-by-step.
We trained a BioBERT-large classifier to select abstracts based on relevance assessments for each sentence-pair <claim, title of the abstract> and continued to train it to select rationales out of each retrieved abstract.
We propose a two-step setting for label prediction, i.e. first predicting "NOT_ENOUGH_INFO" or "ENOUGH_INFO", then label those marked as "ENOUGH_INFO" as either "SUPPORT" or "CONTRADICT".
Due to time limits, we only updated label prediction module to use more up-to-date transformers library and kept using the older version of it on our evidence retrieval module as it is inherited from scifact baseline repository.
To do evidence retrieval, please create a virtual environment named evidence
with the following command.
conda env create -f evidence/evidence.yml
To do two-step label prediction, please create a virtual environment named two-step
with the following command.
conda env create -f label/two-step.yml
Please use evidence
environment if you want to reproduce results from the baseline system; please use two-step
environment if you want to reproduce BERTscore related results.
For your convenience, the proprecessed data is available here; please see detailed descriptions of the SCIFACT dataset here.
All of the model checkpoints are available on Huggingface model hub under the user name xiazeng
. In general, the naming pattern is xiazeng/sciverbinary-model_${training_data}-${base_model}-${task}
, where training_data
may be train_data
or train_dev_data
, base_model
may be biobertb
(BioBERT-base), biobertb_wo
(BioBERT-base without continued pertaining on SciFact corpus) biobertl
(BioBERT-large), robertal
(RoBERTa-large), task
may be abstract
(trained on the task of abstract retrieval), rationale
(trained on the task of rationale selection), rationale_continue
(first trained on the task of abstract retrieval and the continued the training on the task of rationale selection), label-neutral_detector
(trained on the task of detecting whether there is enough info or not, i.e, the first step of the two-step label prediction task), label-entail_detector
(trained on the task of detecting whether there is entailment or not, i.e, the second step of the two-step label prediction task).
The models we used to generate final submissions are have the model id xiazeng/sciverbinary-model_train_dev_data-biobertl-abstract
, xiazeng/sciverbinary-model_train_dev_data-biobertl-rationale_continue
, xiazeng/sciverbinary-model_train_dev_data-robertal-label-neutral_detector
and xiazeng/sciverbinary-model_train_dev_data-robertal-label-entail_detector
(order is the pipeline order). They are trained with both train set and dev set.
Our paper reported model performance of various BERT variants that are trained on train set and evaluated on dev set. They are also available with coresponding model ids.
After downloading the dataset and trained models, please use script/pipeline.sh
to run the whole pipeline and get evaluation metrics. See detailed usage instructions in the script.
Please use evidence/training/abstract_retrieval/classifier.py
to train a model to do abstract retrieval.
Please use evidence/training/rationale_selection/classifier.py
to train a model to do rationale selection.
Please use label/training/neutral_detector.py
to train a model to detect "NOT_ENOUGH_INFO".
Please use label/training/support_detector.py
to train a model to detect "SUPPORT".
Email: [email protected]