This project is an analysis of Berkshire Hathaway's annual letters using Natural Language Processing with Python. Approaches included three types of extractive summarization: LexRank, TextRank, and Latent Semantic Analysis, as well as topic modeling using the Mallet wrapper from Gensim and a Java version of Mallet LDA.
If you plan to run this code, make sure to set the file locations and shortcuts for Mallet to your respective files on your computer, as otherwise, the code will not run on your computer. I would recommend not running the notebook to scrape the letters and just using the letters that come with it instead because usually, Berkshire's website denies me access from scraping multiple letters at once.
Also note that the final topics change from run to run, even with the same random seed. The final topic distributions between the notebook and the report differ slightly.