Skip to content

jeremyholcombe/pdf-scraping-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdf-scraping-R

Small project to extract the majors from university commencement programs stored in PDF format, using R. PDFs are notoriously difficult to scrape as there is often little structure to how the information is displayed on the page.

This program extracts the data from Bowdoin College, first converting the PDF to raw text, then employing a variety of techniques to parse the text using regular expressions and statistical measures. This program could be extended to similarly formatted PDFs, although each new PDF would likely need to receive individual treatment.

About

Scraping PDFs using R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages