Skip to content

cu-library/etdscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

etdscraper

This project is meant to scrape metadata using scrapy output the data into a json file and then compare those files to check certain fields and will report if they do not match

Setup:

Make sure you are running in your virtual environment and to install scrapy

pip install scrapy

The main files that are the curvespider.py and the hyraxspider.py both of which will scrape the data we are looking for.

Both the hyrax and curve spider can be configured to pull all thesis and all the data if you change the parse method to callback parse_all_pages and this will grab all thesis on curve or hyrax. The other two parse methods are for a single page or with a provided JSON file. You can run it by using this command and it will output it to a json file

scrapy crawl curvespider -o curve_output.json

scrapy crawl hyraxspdier -o hyrax_output.json

The last part of it is the cuirator_compare.py

This will compare all the fields and dump out any errors in a json files afterwards.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages