WebScraping_TextMining

This repository includes some of my practicing notebooks focusing on webscraping and text mining on patent citation and abstracts.

'WebScrapingandAPI_patentcitationdata': Using web scraping and APIs (Pandas, BeautifulSoup, google-patent-api) to collect citation data about GAA patents.

'WebScraping_PatentAbstracts_CPCcodes': Using web scraping technique to collect patent abstracts and cpc codes, focusing on patents related to 'renewable energy'.

'TextMining_spaCyTokenizer_TopicModeling_TextClassification': Build customized tokenizer with spaCy. Integrate customized tokenizer to scikit learn countvectorizer to create bag of words (BoW) and tf-idf. Perform Topic Modeling with tf-idf. Build Logistic Regression and Support Vector Classifier to predict patent cpc code and compare test accuracy between models using BoW and tf-idf as input features.

'TextGraph_Token&Document': Use Bag of Words and Tf-Idf to generate token-token graph and document-document graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WebScraping_TextMining

Files

README.md

Latest commit

History

README.md

File metadata and controls

WebScraping_TextMining