Scrapes data from CPCB's CCR dashboard.
The work in this repo builds on top of the work done by Thejesh GN. I take no responsibility for Thej's work and Thej takes no responsibility for the work I have done in this repo. Please contact individual authors for any queries.
This code uses the data.db
file for everything.
First order of business is to set up the sites table in the db.
Add your sites to CSV
- Go to CPCB's CCR website and select the state, city and station of your choice.
- Open the Network tab in Dev Tools and click 'Submit' on the webpage.
- In the Network tab, click on the POST request called
fetch_table_data
. Under the Request tab you'll see the payload of the request. Scroll tofiltersToApply > parameterNames > station
and copy the station code which should look something like -site_123
. - Edit
sites.csv
and add the state, city, site and site_name. - Leave the header row as is. Leave the remaining columns blank.
Get available parameters for each site
- Use csvjson.com's csv2json tool to create a
sites.json
file out of your editedsites.csv
. Save that JSON in your root directory. - Run
yarn
ornpm install
in your root directory. - Run
node cpcb_station_params.js
and you'll get asites_with_params.json
file which expandssites.json
by adding the list of available parameters for each site. - Use csvjson.com's json2csv tool to create a CSV and save it as
sites_with_params.csv
in your root directory.
Add the sites data to db
- Download and install a tool like DB Browser for SQLite
- Open the
data.db
file using DB Browser - Click on Import > Table from CSV and select the
sites_with_params.csv
file from before. Save this table named as sites. You can delete the pre-existing sites table to replace it.
Now that your sites are set up, you can begin to scrape data.
- Use python3 and install the
requests
,dataset
andsqlite3
modules using pip. (Ideally inside a virtualenv usingrequirements.txt
)
Run the following scripts in the given order –
get_availability.py
: gets the months for which data is available for each sitecheck_availability.py
: parses the JSON response from #1 into a listexpedite.py
: populates the params_query and params_ids columns in the sites tablesetup_pull.py
: edit this script to setup the dates for which you need to get data (lines 37-39); running this script sets up all the requests that needs to be called to pull the datapull.py
: pulls the data setup in the previous script; data received is a JSON.parse.py
: parses the JSON data and creates the final data table in db
While all scripts should run quite swiftly, pull.py
is going to be the slowest. Pinging the CPCB server takes time so be patient. And be kind and leave some timeout between subsequent pings.
You can browse the data for all stations in Delhi, Mumbai and Chennai from 01-01-2010 till 31-12-2020 in the reports directory. No need to fetch that again.
- This code is licensed under GNU GPL v3.
- Please credit by linking to https://thatgurjot.com and https://thejeshgn.com