Skip to content

rohancme/dr-boulder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dr-boulder

Script(s) for the data rescue -boulder event

Assumptions

Disclaimer

  • These scripts have only been tested on a mac, and the use of virtualenv and pip should result in the same behavior on Windows/Linux systems - but it's possible you might need to tweak the code. Very open to suggestions and/or Pull Requests!

  • for Linux distributions with coexistant python2 and python3, use python2 version.

Set things up

  • In your terminal/iterm (mac/unix) or Command Line/Git Shell (Windows):

Clone the repo and create a python virtualenv:

git clone https://github.com/rchakra3/dr-boulder.git
cd dr-boulder
virtualenv env

Activate your virtualenv:

  • Mac/Unix:
source env/bin/activate
  • Windows:
.\env\Scripts\activate

** You should now see (env) in your terminal/command prompt before your folder structure **

  • Download all the requirements:
pip install -r requirements.txt

That should have everything setup in your virtualenv

Scripts

There are currently 3 important scripts in this repository:

  1. Generate a list of all the files available on an FTP server:

    1. To run:

      python -W ignore ftp_utils/get_all_files_from_ftp_server.py --server=<server domain name or IP> --output_file=<output file name>
      
    2. This will generate a list of all the files that are available for download at a particular domain

    3. The name/IP of the server is required. If the output file is not specified, it will write to ftp_files.txt

  2. Download all the URLs listed in a file [NON FTP]:

    1. This helps download a huge list of URLs (pdfs, json, xmls, etc)

    2. Put the list of URLs in a file - 1 URL per line

    3. Help:

      python download_data.py -h
      
    4. To run:

      python -W ignore download_data.py --filename=<name of file specified in 2> --max_space=<max disk space to use(Defaults to 5GB) --downloads_folder=<name of folder where you want to store the data>
      
  3. Download a list of files at FTP endpoints:

    1. Same as the previous script, but for FTP files
    2. You can use the file generated by the ftp_utils/get_all_files_from_ftp_server.py script as the input file for this script or create a new file with one ftp file per line
    3. FTP downloads seem to be much slower in general - Would recommend running the script over a small number of files at a time
    4. Help:
      python ftp_utils/download_ftp_files.py -h
      
    5. Run:
      python -W ignore ftp_utils/download_ftp_files.py --filename=<name of file specified in 2> --downloads_folder=<folder where you want to save the files>
      

Domain-specific scripts:

So far there's only one. For edg.epa.gov/data/public

Generate the list of files:
cd edg_epa_data_public
python -W ignore find_data_edg_epa.py
  • This script will generate 3 files:
    1. edg_epa_file_list.txt: The list of all the files that aren't ftp://
    2. edg_epa_ftp_file_list.txt: The list of all the files that are ftp://
    3. edg_epa_skipped_file_list.txt: The list of files that weren't downloaded for various reasons, including running out of disk space, exceeding the space limit specified, 404s
Downloading the files:

Use the scripts described above to download the URLs in the edg_epa_file_list.txt and edg_epa_ftp_file_list.txt lists

About

Script(s) for the data rescue -boulder event

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages