-
!aws s3 cp --no-sign-request --recursive s3://ai2-s2-research-public/open-corpus/2020-05-27/ corpus_data
-
Upload this Data To your Bucket which is specified in
~/.metaflowconfig/config.json
to a path like `$S3_PATH/datasets/semantics_scholar_corpus_data/from metaflow import S3 import os data_path = '/'.join(S3.get_root_from_config([]).split('/')[:-1]) s3_root = os.path.join(data_path,'datasets') def sync_folder_to_s3(root_path,base_path='',s3_root=s3_root): sync_path = os.path.join(s3_root,base_path) file_paths = [(os.path.normpath(os.path.join(r,file)),os.path.join(r,file)) for r,d,f in os.walk(root_path) for file in f] # return file_paths for normpth,pth in file_paths: with S3(s3root=s3_root) as s3: file_paths = s3.put_files([(normpth,pth)]) print(f"Finished Writing Path : {pth}") sync_path = os.path.join( sync_path,os.path.normpath(root_path )) return sync_path,file_paths sync_path,file_paths = sync_folder_to_s3('corpus_data',base_path='semantics_scholar_corpus_data')
The flow are follow the following pattern:
- Create clean data from large mound of semantic scholar cropus. Cleaning can be removing all uncited material.
- Use the cleaned data to extract the CS based
- Use the CS based info to create Graph and classify ontology
- Use graph to calc page rank and collate that with ontology results.
- Dump this final processed information to elasticsearch
- citation_harvest_flow.py contains
SemScholarCorpusFlow
which will extract and remove all uncited material and saves dataframe for Each chunk under the Bucket$CONFIG_S3_ROOT/processed_data/SemScholarCorpusFlow/s2-corpus-{i}/usefull_citations.csv
CSDataExtractorFlow
-----Requires-->SemScholarCorpusFlow
- cs_citation_extractor.py contains
CSDataExtractorFlow
which will use the Chunks set bySemScholarCorpusFlow
and extractComputer Science
related Articles. It stores each chunk under the Bucket$CONFIG_S3_ROOT/processed_data/CSDataExtractorFlow/s2-corpus-{i}/category_citations.csv
CSDataConcatFlow
----Requires-->CSDataExtractorFlow
- concat_cs_flow.py contains
CSDataConcatFlow
which concatenates the data into a single dataframe. This doesn't turn out that useful as csv becomes 22GB in size. The concat csv is stored at$CONFIG_S3_ROOT/processed_data/CSDataConcatFlow/cs-concat-data.csv
.
CSOntologyClassificationFlow
----Requires-->CSDataExtractorFlow
- ontology_classify_flow.py contains
CSOntologyClassificationFlow
. It will use thecso_classifier
.TODO: Push submodule which works as a relative import and Link it to repo.
This module will classify the data in the chunked csvs according to ontology described here. The chunked dataframes are stored at$CONFIG_S3_ROOT/processed_data/CSOntologyClassificationFlow/s2-corpus-{i}/ontology_processed.csv
CSGraphBuilderFlow
----Requires-->CSDataExtractorFlow
- ontology_classify_flow.py contains
CSGraphBuilderFlow
which will use theinCitations
andoutCitations
information from the chunked csv stored fromCSDataExtractorFlow
and store the graph to S3. The graph will be stored under$CONFIG_S3_ROOT/processed_data/CSGraphBuilderFlow/citation_network_graph.json
CSPageRankFinder
----Requires-->CSGraphBuilderFlow
- page_rank_flow.py will use the graph stored via
CSGraphBuilderFlow
and performs page rank based on parameters. Finally stores the rank dictionary to$CONFIG_S3_ROOT/processed_data/CSPageRankFinder/page-rank-{run-id}.json
-
PageRankCollateFlow
----Requires-->CSPageRankFinder
-
PageRankCollateFlow
----Requires-->CSOntologyClassificationFlow
-
collate_page_rank_flow.py contains the flow that will collate Page-Rank results from
CSPageRankFinder
flow and ontology classification results fromCSOntologyClassificationFlow
. This will finally store chunked csvs to$CONFIG_S3_ROOT/processed_data/PageRankCollateFlow/s2-corpus-{i}/ontology_processed.csv
- Extracts then csv chunks created by the
PageRankCollateFlow
(which have ontology and page rank collated results) and throws the data to elasticsearch.
- Utility module for data-syncing/s3 looksup and Dataroot path fixing when creating the flows.
- Cleanup Readmes
- Add More information about running flows properly.
- Create flow dependency graph
- Replace Memory Hogging Networkx with https://github.com/VHRanger/CSRGraph for PageRank