NYT Crawler

Media Analytics is a web app that allows anyone to query a large corpus of journalistic data using natural language processing tools. My role in this project revolved around collection of data, specifically articles from the New York Times archive. The NLP model needed to support frequency of word usage over the last 100+ years, which required the collection of millions of articles. To accomplish this, I learned how to use the Scrapy web crawling framework indepthly and created a Spider which crawled through the NYT archive and scraped the appropriate items from the correct links.

Built With

Scrapy - Web crawling framework

Authors

Fawaz Dinnunhan

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nyt		nyt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYT Crawler

Built With

Authors

About

Releases

Packages

Languages

fawazd/NYT-Crawler

Folders and files

Latest commit

History

Repository files navigation

NYT Crawler

Built With

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages