This repository includes some of my practicing notebooks focusing on webscraping and text mining on patent citation and abstracts.
'WebScrapingandAPI_patentcitationdata': Using web scraping and APIs (Pandas, BeautifulSoup, google-patent-api) to collect citation data about GAA patents.
'WebScraping_PatentAbstracts_CPCcodes': Using web scraping technique to collect patent abstracts and cpc codes, focusing on patents related to 'renewable energy'.
'TextMining_spaCyTokenizer_TopicModeling_TextClassification': Build customized tokenizer with spaCy. Integrate customized tokenizer to scikit learn countvectorizer to create bag of words (BoW) and tf-idf. Perform Topic Modeling with tf-idf. Build Logistic Regression and Support Vector Classifier to predict patent cpc code and compare test accuracy between models using BoW and tf-idf as input features.
'TextGraph_Token&Document': Use Bag of Words and Tf-Idf to generate token-token graph and document-document graph.