Skip to content

graphistry/dots

Repository files navigation

Current Events Scraper & Featurizer

Using OpenSearch and Google News APIs, this tool pulls news stories and extracts features from the text. The features are then stored in a CSV file.

Can gather stories from multiple sources and languages. GNews maxes out at ~3000 stories per day, OpenSearch has no limit. OpenSearch uses scroll and slice to pull a large number of stories .

Clone current version & run dots_feat.py

requirements : pytest, pyarrow, spacy, python-dotenv, bs4, pandas, scikit-learn, transformers, torch, opensearch-py, requests, nltk, numpy, graphistry[umap-learn], umap-learn, validators, pytesseract, selenium, webdriver_manager, undetected_chromedriver, gliner,

the example below will pull 100 OS gnews stories and return features each in additon to location and date to a file

    git clone https://github.com/graphistry/dots
    python dots/dots_feat.py -n 100 -e 0 -d 0 -o dots_drba_feats.csv
    python dots/dots_feat.py -n 100 -e 1 -d 0 -o dots_gpy_feats.csv  
    python dots/dots_feat.py -n 100 -e 2 -d 0 -o dots_glnr_feats.csv  

"'Gaza Strip', '16-01-2024', ","['neighborhoods', 'rebels', 'widespread famine', 'egypt', 'disease']"
"'Miseno, Campania, Italy', '16-01-2024', ","['disasters', 'mount vesuvius', 'ancient cataclysm', 'costruzione', 'beach']"
"'Clarendon, Clarendon, Jamaica', '16-01-2024', ","['new bowen', 'fight', 'whatsapp', 'st catherine', 'jamaica']"
"'Philadelphia, Pennsylvania, United States', '16-01-2024', ","['meteorologists', 'snow shovels', 'snowstorm', 'accuweather alerts', 'accuweather meteorologists']"
"'New Bedford, Massachusetts, United States', '16-01-2024', ","['massachusetts law', 'saturday', 'ariel dorsey', 'traffic', 'united states']"
"'Corofin, Clare, Ireland', '16-01-2024', ","['emergency services', 'breathing', 'rescue service', 'firefighters', 'afternoon']"
"'United States', '16-01-2024', ","['preparedness', 'earthquake', 'quake', 'morning', 'disaster']"
"'Syria', '16-01-2024', ","['neighboring countries', 'early recovery', 'cholera', 'symptom', 'mohamad katoub']"
"'Iceland', '16-01-2024', ","['lava flows', 'evacuation', 'eruptions', 'jóhannesson', 'lúðvík pétursson']"

here is an example produced every day via gh_actions parsing gNews stories and extracting features: Feature Table and Full Table

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages