etdscraper

This project is meant to scrape metadata using scrapy output the data into a json file and then compare those files to check certain fields and will report if they do not match

Setup:

Make sure you are running in your virtual environment and to install scrapy

pip install scrapy

The main files that are the curvespider.py and the hyraxspider.py both of which will scrape the data we are looking for.

Both the hyrax and curve spider can be configured to pull all thesis and all the data if you change the parse method to callback parse_all_pages and this will grab all thesis on curve or hyrax. The other two parse methods are for a single page or with a provided JSON file. You can run it by using this command and it will output it to a json file

scrapy crawl curvespider -o curve_output.json

scrapy crawl hyraxspdier -o hyrax_output.json

The last part of it is the cuirator_compare.py

This will compare all the fields and dump out any errors in a json files afterwards.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
etdscraper		etdscraper
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etdscraper

About

Releases

Packages

Languages

cu-library/etdscraper

Folders and files

Latest commit

History

Repository files navigation

etdscraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages