-
Notifications
You must be signed in to change notification settings - Fork 12
DataLoaders
PyForecast allows users to create their own dataloaders in order to import custom datasets. The following section discussed how to write a dataloader so that PyForecast can understand it.
A basic structure for a PyForecast Dataloader is shown below:
# SCRIPT NAME: ExampleLoader
# SCRIPT AUTHOR: Jane Doe
# DESCRIPTION: brief description
# IMPORT LIBRARIES
import pandas as pd
import ...
# DATA LOADER INFORMATION FUNCTION
def dataLoaderInfo():
# OPTIONS DISPLAYED WHEN USER WANTS TO ADD A STATION USING THE DATALOADER
optionsDict = {
"stationNumber" : "",
"parameterCode" : "",
"statCode" : ""
}
# OPTIONAL DESCRIPTION
description = "Example Description"
return optionsDict, description
# DATA LOADER
def dataLoader(stationDict, startDate, endDate):
# SCRIPT TO RETRIEVE DAILY DATA BETWEEN START AND END DATES
# RETURN DATAFRAME
return df
At minimum, a dataloader must have 2 defined, top-level functions: A "dataLoaderInfo" function and a "dataLoader" function. PyForecast will not allow the user to save a dataloader if it does not include these 2 functions.
The 'dataLoaderInfo' function describes the parameters that the dataloader needs in order to operate. When a user wants to add a dataset using a custom dataloader, they will be presented with the description described in this function and will be required to provide parameters for each option in the 'optionsDict'. In the above example, when the user tries to add a station using this dataloader, they will be required to provide a 'stationNumber', a 'parameterCode', and a 'statCode'.
The 'dataLoader' function reads the parameters supplied by the user in the optionsDict, and downloads data for that dataset. The required arguments for this function are presented above and cannot be changed. The stationDict argument is essentially the completed 'optionsDict' from the 'dataLoaderInfo' function. The PyForecast software provides start and end dates to the function.
When writing a 'dataLoader' function, it is important to keep a few things in mind. The start and end dates are supplied to the function in python datetime format, and must be converted to the format required by your custom dataloader. The 'datetime' library provides numerous methods for parsing python datetimes into strings. The retireved data must be processed into daily data (one data value per day). The function must return a pandas dataframe containing a datetime index (one index value per day) and one column of NUMERIC data. You must ensure that the entire column of data is in a numeric format (i.e. integer, float64, float32, etc.). The returned dataframe must be an actual dataframe, and not a pandas Series.
Lastly, PyForecast is distributed with a wide range of packages including:
- 'requests' (for GET/POST web service calls)
- 'zeep' (for SOAP protocol web service calls)
- 'bs4' (beautiful soup 4, for HTML parsing)
- 'urllib3' (alternative way to retrieve webpages)
- 'ftplib' (for interacting with ftp sites)
- 'zipfile' (for unzipping retrieved files)
- 'json' (for parsing JSON data)
- 'numpy' (for mathematical processing and array processing)
- 'scipy' (for statistical processing)
- 'pandas' (for creating and operating on dataframes)
- 'PyQt5' (advanced users: allows user input during data downloading through dialogs)
An acceptable example dataloader that loads streamflow data from the Wyoming State Engineers Office is shown below:
'''
Script Name: WYSEO_Loader
Script Author: Jane Doe
Description: Loads streamflow data from the WY_SEO.
'''
# Import Libraries
import pandas as pd
import numpy as np
import requests
from datetime import datetime
from io import BytesIO
from zipfile import ZipFile
# Dataloader information function
def dataLoaderInfo():
# Define the required options for the dataLoader.
# You only need a locationID to download data for a station.
optionsDict = {
"locationID":""
}
# write a short description
description = "Downloads daily streamflow data for a station described by a locationID"
return optionsDict, description
# Dataloader function
def dataLoader(stationDict, startDate, endDate):
# Generate the URL to retreive the data
url = "http://seoflow.wyo.gov/Data/Export_DataLocation/?location={0}&date={1}&endDate={2}&calendar=1&exportType=csv".format(
stationDict['locationID'],
datetime.strftime(startDate, '%Y-%m-%d'),
datetime.strftime(endDate, '%Y-%m-%d'))
# Retrieve the data using a GET request
data = requests.get(url)
# Check to make sure the web service call was successful
if data.status_code == 200:
continue
else:
return pd.DataFrame() # return an empty dataframe
# This GET operation is going to return a bunch of csv files in a zipped folder
# The data is returned in a zipped csv file. We'll temporarily write the zipped-byte-data to a string object.
zipData = BytesIO()
zipData.write(data.content)
# turn into a zipfile object
zipData = ZipFile(zipData)
# Create an empty dataframe
df = pd.DataFrame()
# Iterate through each csv and read into a dataframe
for i in range(len(zipData.infolist())):
# Get the csv filename
fileName = zipData.infolist()[i].filename
# Read the csv file into the dataframe
df2 = pd.read_csv(zipData.open(fileName), header = 1, parse_dates = True, infer_datetime_format = True, index_col=0)
df = pd.concat([df, df2], axis=0)
# Isolate the discharge column
df = pd.DataFrame(df['Value (Cubic Feet Per Second)'], index = df.index)
# Give the data a meaningful name
df.columns = ['{0} | Streamflow | CFS'.format(stationDict['locationID'])]
# Return the dataframe
return df