Keyphrase Generation for Scientific Document Retrieval

This repository contains the code for reproducing the experiments from the paper:

Keyphrase Generation for Scientific Document Retrieval. Florian Boudin, Ygor Gallina, Akiko Aizawa. Association for Computational Linguistics (ACL), 2020.

Data

Data
Installing anserini
Indexing
- Converting documents to TREC format
- Creating indexes
Retrieval
- Converting topics to TREC format
- Retrieving documents
Evaluation

Data

Here, we use the NTCIR-2 ad-hoc monolingual (English) IR test collection. The test collection contains 322,058 documents, 49 search topics and relevance judgments.

|-- data
    |-- docs
        |-- ntc1.e1.gz  // NTCIR-1 (#187,080) collection converted with ACCN-e.pl 
        |-- ntc2-e1g.gz // NTCIR-2 (#77,433) NACSIS Academic Conference Papers Database
        |-- ntc2-e1k.gz // NTCIR-2 (#57,545) NACSIS Grant-in-Aid Scientific Research Database
    |-- rels
        |-- rel1_ntc2-e2_0101-0149 // judgments for relevant documents 
        |-- rel2_ntc2-e2_0101-0149 // judgments for partially relevant documents 
    |-- topics
        |-- topic-e0101-0149 // English topics for NTCIR-2

Installing anserini

Here, we use the open-source information retrieval toolkit anserini which is built on Lucene. Below are the installation steps for a mac computer (tested on OSX 10.14) based on their colab demo.

# install maven
brew cask install adoptopenjdk
brew install maven

# cloning / installing anserini
git clone https://github.com/castorini/anserini.git --recurse-submodules
cd anserini/
# changing jacoco from 0.8.2 to 0.8.3 in pom.xml to build correctly
mvn clean package appassembler:assemble

# compile evaluation tools and other scripts
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Indexing

Converting documents to TREC format

First, we convert NTCIR SGML formatted documents to TREC format for easier indexing.

From

<REC>
    <ACCN>...</ACCN> // doc_id
    <TITE>...</TITE> or <PJNE>...</PJNE> // Title
    <AUPE>...</AUPE> // Authors
    <CNFE>...</CNFE> // Conference name
    <CNFD>...</CNFD> // Conference date
    <ABSE>           // Abstract
        <ABSE.P>...</ABSE.P> // paragraph
    </ABSE>
    <KYWE>...</KYWE> // Keywords
    <SOCE>...</SOCE> // Host society
</REC>

to

<DOC>
    <DOCNO>...</DOCNO>  // doc_id
    <TITLE>...</TITLE>  // title
    <TEXT>...</TEXT>    // abstract
    <HEAD>...</HEAD>    // keywords (optional)
</DOC>

by doing:

sh src/0_create_data.sh

Some statistics about the generated data:

ntc1-e1: 187,080 documents, 185,061 with keywords
ntc2-e1g: 77,433 documents, 75,081 with keywords
ntc2-e1k: 57,545 documents, 57,443 with keywords

all 322,058 documents, 317,585 with keywords (98.6%)

Creating indexes

We are now ready for indexing!

sh src/1_create_indexes.sh

Retrieval

Converting topics to TREC format

Again, we have to convert NTCIR topics to TREC format for easier retrieval.

From

<TOPIC q=0101> // topic number is an attribute here

    <TITLE>       // title part
    ...
    </TITLE>
    
    <DESCRIPTION> // sentence-length description
    ...
    </DESCRIPTION>
    
    <NARRATIVE>  // longer narrative
    ...
    </NARRATIVE>
     
    <CONCEPT>  // concepts (?)
    ...
    </CONCEPT>
    
    <FIELD>  // fields (?)
    ...
    </FIELD>

</TOPIC>

to

<top>
    <num> Number: XXX 
    <title> ...
    
    <desc> Description: 
    ...
    
    <narr> Narrative: 
    ...
</top>

by doing:

# create topic file with title / description / narrative
python3 src/topics_to_trec.py \
        --input data/topics/topic-e0101-0149 \
        --output data/topics/topic-e0101-0149.title+desc+narr.trec \
        --keep_narrative

Topics are categorized into fields:

Electricity, information and control
Chemistry
Architecture, civil engineering and landscape gardening
Biology and agriculture
Science
Engineering
Medicine and dentistry
Cultural and social science

Retrieving documents

We are now ready to retrieve !

sh src/2_retrieve.sh

Note that the default topic field used for retrieving documents is set to title by default according to anserini SearchCollection helper:

 -topicfield VAL             : Which field of the query should be used, default
                               "title". For TREC ad hoc topics, description or
                               narrative can be used. (default: title)

Evaluation

sh src/3_evaluate.sh

Results

Results for retrieval models using keyphrase generation are reported in the table below. Two initial indexing configurations are examined: title and abstract only (T+A), and title, abstract and author keywords (T+A+K).

MAP	BM25	+RM3	QL	+RM3
T+A	0.2916	0.3193	0.2898	0.3147
+s2s-copy-top5-all	0.3045	0.3356	0.3012	0.3233
+s2s-corr-top5-all	0.3010	0.3306	0.2941	0.3079
+multipartiterank-top5	0.2924	0.3227	0.2956	0.3269

T+A+K	0.3138	0.3517	0.3063	0.3300
+s2s-copy-top5-all	0.3157	0.3652	0.3163	0.3367
+s2s-corr-top5-all	0.3137	0.3526	0.3101	0.3260
+multipartiterank-top5	0.3138	0.3518	0.3123	0.3347

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Keyphrase Generation for Scientific Document Retrieval

Data

Data

Installing anserini

Indexing

Converting documents to TREC format

Creating indexes

Retrieval

Converting topics to TREC format

Retrieving documents

Evaluation

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Keyphrase Generation for Scientific Document Retrieval

Data

Data

Installing anserini

Indexing

Converting documents to TREC format

Creating indexes

Retrieval

Converting topics to TREC format

Retrieving documents

Evaluation

Results