SEC EDGAR Oil Contracts Finder

This project aims to find full oil contract bodies among the filings submitted by oil companies to the American stock regulator.

We've used a variety of approaches:

Download and store SEC filings in the appropriate SIC classes using a Hadoop cluster. The resulting corpus is a JSON stream with an entry for each document filed since 1995.
Score the documents using a second Hadoop cluster by counting terms that indicate an oil contract. The score is considered both normalized over the number of total words in the filed document, and as an absolute number (we actually want to bias for long texts).

We've also used a set of confirmed postive and negative matches to generate a set of "watershed" terms which occur only in the contract documents and not in any others. This was used to generate a search list automatically, for a second phase of ranking.

Contact:

@Open_Oil, @pudo

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
simple		simple
training		training
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
generate_input.py		generate_input.py
genscores.py		genscores.py
gentestparts.py		gentestparts.py
import_filings.py		import_filings.py
mrjob.conf.tmpl		mrjob.conf.tmpl
score_filings.py		score_filings.py
scrape.py		scrape.py
searches.full.txt		searches.full.txt
searches.txt		searches.txt
sic_filter_filings.py		sic_filter_filings.py
sics.txt		sics.txt
stopwords.txt		stopwords.txt
test_listing.txt		test_listing.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEC EDGAR Oil Contracts Finder

About

Releases

Packages

Languages

pudo/edgar-oil-contracts

Folders and files

Latest commit

History

Repository files navigation

SEC EDGAR Oil Contracts Finder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages