The UBC Sauder School of Business group has a large database from the U.S. Securities and Exchange Commission (SEC) fillings, Wharton financial fundamental data, and stock price history. In this project, we seek to build a framework for leveraging SEC filings to obtain industry intelligence. Specifically, we are given two prediction problems: classification of firm survival and predicting firm performance.
This is a data-intensive project, you can find all files required to run the project in the /data/
folder.
Many of these can and should be changed depending on the target of your analysis. Please see the data README herefor more information.
This project can be executed in two main ways, either by using Make (documented here), or by running the python scripts individually (documented here).
A brief overview of the data Pipeline can be seen below:
An overview of all unit tests for every function in the project could be found in here
Topic Analysis We use LDA, and NMF to try and model the topics found in item1 and item7 from the SEC filings.
Word distribution testing with 10 topics: Below is the word distribution of CV-LDA with 1000 filings. The selected topic is topic 2 - medical.
Sentiment Analysis We extract polarity, subjectivity and certainty scores in item1 and item7 from the SEC filings.
An example can be seen as below:
- pandas >= v0.22
- numpy >= v1.14.3
- matplotlib >= v2.1.1
- mongoengine >= v0.15
- missingno >= v0.4.0
- gensim >= v3.4.0
- tqdm >= v4.23.2
- keras >= v2.1.6
- nltk >= v3.3
- tensorflow >= 1.8.0
- sklearn >= 0.19.1
- textblob >= 0.15.1
- pattern >= 2.6
- seaborn >= 0.8.1
- plotly >= 2.7.0
- bokeh >= 0.12.16
To contribute to this project, please see our Contributing Guidelines
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
For more information, please see our Code of Conduct