Skip to content

Commit

Permalink
Explaning main script
Browse files Browse the repository at this point in the history
  • Loading branch information
Pedro Azevedo authored Apr 16, 2020
1 parent 2010149 commit 994d4e7
Showing 1 changed file with 28 additions and 3 deletions.
31 changes: 28 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,34 @@ year={2018}

# System Structure

![System Structure](https://github.com/DeFacto/DeFactoNLP/blob/master/images/work_structure.png)
The system is based on three major tasks (Document Retrieval, Sentence Retrieval, Label Classification). Each task was performed using different techniques:

# Reproducing
* Document Retrieval
* TF-IDF
* NER
* Triple-Based
* Sentence Retrieval
* TF-IDF
* Triple-Based Model
* [Sentence-Transformers](https://github.com/UKPLab/sentence-transformers)
* Label Classification
* RTE Model + Random Forest model

To reproduce this work, an understanding of the files and scripts are needed.
# Run

You can run Document Retrieval and Sentence Retrieval by running the following script: [generate_rte_preds.py](/generate_rte_preds.py).

The script contains the 6 boolean variables:
* **INCLUDE_NER** --> if the input file contains ***NER** Predicted DOCUMENTS* and you want to include them as relevant documents
* **INCLUDE_TRIPLE_BASED** --> if the input file contains ***Triple Based** Predicted DOCUMENTS* and you want to include them as relevant documents
* **INCLUDE_SENTENCE_BERT** --> if the input file contains ***Triples Based** Predicted SENTENCES* and you want to include them as relevant sentences
* **RUN_DOC_TRIPLE_BASED** --> to *Predict **Triple Based** Relevant DOCUMENTS*
* **RUN_SENT_TRIPLE_BASED** --> to *Predict **Triple Based** Relevant SENTENCES*
* **RUN_RTE** -> to run **Recognising Textual Entailment** to calculate the probabilities for every *Relevant Sentences*

Changing this variables will allow to run every step as required, making possible to run every step in a seperate way, all at the same time or even include other Retrieval techniques using files with that information.

To generate the final predictions, run [Label Classification](#Label Classification)

### Data

Expand All @@ -36,6 +59,8 @@ We also created a train subsample using the script [subsample_training_data.py](

The files [subsample_train_relevant_docs.jsonl](/data/subsample_train_relevant_docs.jsonl), [shared_task_dev_public_relevant_docs.jsonl](/data/shared_task_dev_public_relevant_docs.jsonl) and [shared_task_test_relevant_docs.jsonl](/data/shared_task_test_relevant_docs.jsonl) contain the information from the TF-IDF part of Document Retrieval (*predicted_pages*) and Sentence Retrieval (*predicted_sentences*).

All the files have certain keyworks. OIE stands for Open Information Extraction (in Document Retrieval). SENTENCE was performed a Triple-Based method for Sentence Selection. Important to verify the first line of every file to know what Retrieval Method was made.

### TF-IDF (Document and Sentence Retrieval)

The TF-IDF results can be reproduced by running certain scripts inside [fever-baselines](/fever-baselines/) folder. First, download the [database](/fever-baselines#data-preparation) and than, run the [tf-idf](/fever-baselines#evidence-retrieval-evaluation) part.
Expand Down

0 comments on commit 994d4e7

Please sign in to comment.