Skip to content

Latest commit

 

History

History
11 lines (6 loc) · 999 Bytes

README.md

File metadata and controls

11 lines (6 loc) · 999 Bytes

WebScraping_TextMining

This repository includes some of my practicing notebooks focusing on webscraping and text mining on patent citation and abstracts.

'WebScrapingandAPI_patentcitationdata': Using web scraping and APIs (Pandas, BeautifulSoup, google-patent-api) to collect citation data about GAA patents.

'WebScraping_PatentAbstracts_CPCcodes': Using web scraping technique to collect patent abstracts and cpc codes, focusing on patents related to 'renewable energy'.

'TextMining_spaCyTokenizer_TopicModeling_TextClassification': Build customized tokenizer with spaCy. Integrate customized tokenizer to scikit learn countvectorizer to create bag of words (BoW) and tf-idf. Perform Topic Modeling with tf-idf. Build Logistic Regression and Support Vector Classifier to predict patent cpc code and compare test accuracy between models using BoW and tf-idf as input features.

'TextGraph_Token&Document': Use Bag of Words and Tf-Idf to generate token-token graph and document-document graph.