Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic interface #70

Closed
wants to merge 17 commits into from
Closed
Changes from 2 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 23 additions & 10 deletions builders/tslice.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,12 +116,9 @@ def common_parser(filepath, local_attrs, glob_attrs):
for gv in glob_attrs.keys():
fileparts[gv] = glob_attrs[gv]
# add the keys that are common just to the particular glob string
for lv in local_attrs.keys():
if 'glob_string' not in v:
fileparts[lv] = local_attrs[lv]
fileparts.update(local_attrs[filepath])
except Exception:
pass

return fileparts

def build_df(
Expand All @@ -138,24 +135,40 @@ def build_df(
if verify(input_yaml):
# loop over datasets
df_parts = []
entries = {}
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
for dataset in input_yaml.keys():
ds_globals = {}
# get a list of keys that are common to all files in the dataset
for g in input_yaml[dataset].keys():
if 'data_sources' not in g:
if 'data_sources' not in g and 'ensemble' not in g:
ds_globals[g] = input_yaml[dataset][g]
# loop over ensemble members, if they exist
if 'ensemble' in input_yaml[dataset].keys():
for member in input_yaml[dataset]['ensemble']:
filelist = get_asset_list(member['glob_string'], depth=0)
member.pop('glob_string')
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
for f in filelist:
entries[f] = member
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
# loop over all of the data_sources for the dataset, create a dataframe
# for each data_source, append that dataframe to a list that will contain
# the full dataframe (or catalog) based on everything in the yaml file.
for stream_info in input_yaml[dataset]['data_sources']:
#for stream_info in input_yaml[dataset]['data_sources']:
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
for i,stream_info in enumerate(input_yaml[dataset]['data_sources']):
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
filelist = get_asset_list(stream_info['glob_string'], depth=0)
stream_info.pop('glob_string')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make use of the pop call:

Suggested change
filelist = get_asset_list(stream_info['glob_string'], depth=0)
stream_info.pop('glob_string')
glob_string = stream_info.pop('glob_string')
filelist = get_asset_list(glob_string, depth=0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for this yet. I just wanted to push my latest working version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @kmpaul

for f in filelist:
if f in entries.keys():
entries[f].update(stream_info)
else:
entries[f] = stream_info
sherimickelson marked this conversation as resolved.
Show resolved Hide resolved
if columns is None:
columns = []
b = Builder(columns, exclude_patterns)
df_parts.append(b(filelist, parser, d=stream_info, g=ds_globals))
# create the combined dataframe from all of the data_sources and datasets from
# the yaml file
df = pd.concat(df_parts,sort=False)
df_parts.append(b(filelist, parser, d=entries, g=ds_globals))
# create the combined dataframe from all of the data_sources and datasets from
# the yaml file
df = pd.concat(df_parts,sort=False)
print(df)
return df.sort_values(by=['path'])
else:
print("ERROR: yaml file is not formatted correctly. See above errors for more information.")
Expand Down