Skip to content

Ehzoahis/AcademicRank

Repository files navigation

AcademicRank

2021SP FORWARD Lab Project

Introduction

The goal of the project is to calculate the rank of academic works given a keyword. The rank will be calculated according to the Field of Study of the paper. The ranking algorithm is inspired by The PageRank Citation Ranking: Bringing Order to the Web, with the assumption that similarity between the papers and the target keywords can only be distributed once. Currently, the program can only handle the keyword with multiple words to ensure the accuracy of ranking.

Installation

Install the package using requirements.txt

pip3 install -r requirements.txt

Datasets

Microsoft Academic Graph

The Mircosoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications and fields of study. The schema of the dataset can be found here. Among those dataset files, we would use:

  • FieldsOfStudy
  • PaperFieldsOfStudy
  • PaperReferences

The downloaded data can be found on owl3 server, path.

Springer-83K CS Keywords

The CS keywords collected from Springer by Yanghui Pang. Dataset can be found here.

word2vec Model

The word2vec model is trained on the abstract of papers in arXive dataset by Edward Ma. The model can be found here.

Usage

Build the Pruned MAG Dataset

To speed up the ranking algorithm, we need to first prune out the Field of Study (FoS) that are not CS keywords.

python3 prune_fos.py

The resulting FoS list will be in pruned_FOS.txt.

We further need to prune out the papers and references that do not relate to CS.

python3 prune_paper_edge.py

The resulting file are cspapers.txt and pruned_PR.txt.

If any issue exists when running prune_fos.py or prune_paper_edge.py please check the original codes which are more stable.

Perform AcademicRank

The preparation work only need to be done once. To calculate the rank of papers given keywords, do

python3 academic_rank.py [keyword1,keyword2,...]

where keywords need to be separated by ',' and keywords with multiple words need to be connected by '_'. E.g.

python3 academic_rank.py computer_science,data_mining

Visualization

Since the academic_rank.py will give a list of paper ID, we can find the name of the papers given the ID using MAG API. See methods and examples from visualization.ipynb for more information.

Reservation

The accuracy of this program is not guaranteed because the vocabulary of the word2vec model is not large enough and thus the keyword similarity cannot be calculated in the most times. Currently, the program is assigning dummy similarity to the keywords that are not in word2vec model.

Author

About

2021SP FORWARD Lab Project by Haozhe Si

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published