Skip to content

AidenJiang01/WebScraping_TextMining

Repository files navigation

WebScraping_TextMining

This repository includes some of my practicing notebooks focusing on webscraping and text mining on patent citation and abstracts.

'WebScrapingandAPI_patentcitationdata': Using web scraping and APIs (Pandas, BeautifulSoup, google-patent-api) to collect citation data about GAA patents.

'WebScraping_PatentAbstracts_CPCcodes': Using web scraping technique to collect patent abstracts and cpc codes, focusing on patents related to 'renewable energy'.

'TextMining_spaCyTokenizer_TopicModeling_TextClassification': Build customized tokenizer with spaCy. Integrate customized tokenizer to scikit learn countvectorizer to create bag of words (BoW) and tf-idf. Perform Topic Modeling with tf-idf. Build Logistic Regression and Support Vector Classifier to predict patent cpc code and compare test accuracy between models using BoW and tf-idf as input features.

'TextGraph_Token&Document': Use Bag of Words and Tf-Idf to generate token-token graph and document-document graph.

Releases

No releases published

Packages

No packages published