2021SP FORWARD Lab Project
The goal of the project is to calculate the rank of academic works given a keyword. The rank will be calculated according to the Field of Study of the paper. The ranking algorithm is inspired by The PageRank Citation Ranking: Bringing Order to the Web, with the assumption that similarity between the papers and the target keywords can only be distributed once. Currently, the program can only handle the keyword with multiple words to ensure the accuracy of ranking.
Install the package using requirements.txt
pip3 install -r requirements.txt
The Mircosoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications and fields of study. The schema of the dataset can be found here. Among those dataset files, we would use:
- FieldsOfStudy
- PaperFieldsOfStudy
- PaperReferences
The downloaded data can be found on owl3 server, path.
The CS keywords collected from Springer by Yanghui Pang. Dataset can be found here.
The word2vec model is trained on the abstract of papers in arXive dataset by Edward Ma. The model can be found here.
To speed up the ranking algorithm, we need to first prune out the Field of Study (FoS) that are not CS keywords.
python3 prune_fos.py
The resulting FoS list will be in pruned_FOS.txt
.
We further need to prune out the papers and references that do not relate to CS.
python3 prune_paper_edge.py
The resulting file are cspapers.txt
and pruned_PR.txt
.
If any issue exists when running prune_fos.py
or prune_paper_edge.py
please check the original codes which are more stable.
The preparation work only need to be done once. To calculate the rank of papers given keywords, do
python3 academic_rank.py [keyword1,keyword2,...]
where keywords need to be separated by ',' and keywords with multiple words need to be connected by '_'. E.g.
python3 academic_rank.py computer_science,data_mining
Since the academic_rank.py
will give a list of paper ID, we can find the name of the papers given the ID using MAG API. See methods and examples from visualization.ipynb
for more information.
The accuracy of this program is not guaranteed because the vocabulary of the word2vec model is not large enough and thus the keyword similarity cannot be calculated in the most times. Currently, the program is assigning dummy similarity to the keywords that are not in word2vec model.
- Haozhe Si
- Instructed by Professor Kevin Chang