Skip to content

Latest commit

 

History

History
67 lines (54 loc) · 5.14 KB

File metadata and controls

67 lines (54 loc) · 5.14 KB

Cross-Lingual Argumentative Relation Identification: from English to Portuguese

This project includes the source code accompanying the following paper:

Gil Rocha, Christian Stab, Henrique Lopes Cardoso and Iryna Gurevych. Cross-Lingual Argumentative Relation Identification: from English to Portuguese. In Proceedings of the 5th Workshop on Argument Mining. EMNLP 2018.

Abstract: Argument mining aims to detect and identify argument structures from textual resources. In this paper, we aim to address the task of argumentative relation identification, a subtask of argument mining, for which several approaches have been recently proposed in a monolingual setting. To overcome the lack of annotated resources in less-resourced languages, we present the first attempt to address this subtask in a cross-lingual setting. We compare two standard strategies for cross-language learning, namely: projection and direct-transfer. Experimental results show that by using unsupervised language adaptation the proposed approaches perform at a competitive level when compared with fully-supervised in-language learning settings.

Setup

> git clone THIS REPO
> cd multilingual-relation-detection
> conda create -n %ENV_NAME% python=2.7 anaconda (Optional)
> source activate %ENV_NAME% (Optional)
> pip install -r requirements.txt

Project structure:

Summary of the main files and folders of the project with a brief decription:

  • ArgRelationIdentification.py: Main file (contains the project logic/structure)
  • DatasetLoader.py: Loads the dataset for a specific experiment. Contains several utils to work with the dataset (e.g. cv splits, undersampling, translations, ...)
  • ExperimentSetups.py: Several functions that encapsulate all the experimental setups (e.g. in-language, direct transfer, projection)
  • RunExperiments.py: Scripts to run the code for the experiments reported in the project.
  • data/: Contains the data used in this project (corpora, datasets, ...) and produced when running the experiments (models, logs)
    • data/corpora/: contains the corpora used in this experiments as they are publicly available (Brat, OVA format, ...). Subdivided according to the language.
    • data/generatedDatasets/: contains the datasets for the task of Argumentative Relation Identification, generated by running the scripts in corpora_reader/. Subdivided according to the language.
      • data/generatedDatasets/WordEmbeddings/: files containing the fixed vocabulary and corresponding word embeddings that are automatically added here when running the code
      • data/generatedDatasets/FoldsPartition/: folds partitions (used for in-language cross-validation settings) are stored here (avoiding preprocessing repetitions)
    • data/logs/: log files and pickle content generated in each run (e.g. predictions, confusion matrices, ...) are added here. - As a convention, filenames follow the format: %NeuralNetworkArchitecture%_%SentenceEncoding%_%Language%_%SomethingDescribingFileContent%. - e.g. for the following configuration: { "neuralNetArchitectureName": "SumOfEmbeddings_Concatenation_1Layer", "sentenceEncodingType": "specialCharBeginOfClaim", "datasetPath": os.path.abspath("data/generatedDatasets/en/essays/"), "datasetFilename": "ArgEssaysCorpus_context_en", "datasetLanguage": "en", } when running the code, files with the following filename will be created: SumOfEmbeddings_Concatenation_1Layer_specialCharBeginOfClaim_en (e.g. SumOfEmbeddings_Concatenation_1Layer_specialCharBeginOfClaim_en_Predictions.pkl)
    • data/models/: stores trained models (filename follows convention previously described)
    • data/wordEmbeddings/: pre-trained embeddings you can download from CMU MultiLingual Embeddings. These files correspond to the Word2Vec format of the original GloVe format embeddings we can download from the CMU website. They were also renamed to match the following expected filename format from the code: multilingualEmbeddings_%Language% (e.g. multilingualEmbeddings_en.txt for the English version)
  • corpora_reader/: Contains a set of utils to process the corpora used in this project to the datasets used for Arg Relation Identification
    • corpora_reader/CorpusReader.py: Abstract class that encapsulates several functions used to process the corpora.
      • Classes inheriting from this abstract are responsible to generate the dataset for Arg Relation Identification from the corpora files. The dataset must be saved in a .csv file, organized in the following columns:

        • Article Id: identifier of the original file from which the ADU (Argumentative Discourse Unit) pairs were retrieved
        • Source ADU: i.e. premise
        • Target ADU: i.e. claim/conclusion,
        • RelationType: can be one of the following: none, support, attack
        • Partition: can be one of the following: train, test, validation

        This .csv file is expected in the following components of the project.

Contacts:

If you have any questions regarding the code/project, don't hesitate to contact the authors or report an issue.