-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer metadata from the data *.csv file #173
Conversation
Each key map to a list of descriptors which themselves contain a "fields" attribute. We map the values under this "fields" attribute to the keys of FOREIGN_KEY_DESCRIPTORS
The function infer_metadata_from_data can be run on a datapackage where only the data/elements and data/sequences have been defined by a user and will generate automatically the datapackage.json file containing the metadata.
p.add_resource(r.descriptor) | ||
|
||
|
||
def infer_metadata_from_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea was to make this feature optional, therefore I did not want to modify infer_metadata
function
if "/elements/" in r.descriptor["path"]: | ||
infer_resource_basic_foreign_keys(r) | ||
# this function saves the metadata of the package in json format | ||
infer_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this function does already part of the job if provided a dict foreign_keys
, I used it and therefore just wrote the function infer_resource_basic_foreign_keys
to fill this foreign_keys
dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature! Thanks for implementing.
Tested it with existing investment datapackage from examples - files were almost identical (only no empty line add end of file was missing 😛).
Would be nice to such test in tests folder too (aka build metadata for existing datapackage in examples and check if files are identical).
I also found three smaller issues.
|
||
def infer_metadata_from_data( | ||
package_name="default-name", | ||
path=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default parameter "None" throws an error and makes no sense IMO.
I would make it mandatory and first argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata_filename="datapackage.json", | ||
): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small docstring would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# write an error message here | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error should be thrown here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review and your comments :) I will implement your suggestions and run those on the example datapackages :) |
sequence_labels = [] | ||
duplicated_labels = [] | ||
for r in p.resources: | ||
if "/sequences/" in r.descriptor["path"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here one has to use os.sep
instead of "/"
sequences_profiles_to_resource = map_sequence_profiles_to_resource_name(p) | ||
|
||
for r in p.resources: | ||
if "/elements/" in r.descriptor["path"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here one has to use os.sep
instead of "/"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run the new function on files and some of them are just changing the order of the resources within the datapackage. There was a difference which helped me find a bug and solve it in c7ab4bf |
Those were generated by infer_metadata_from_data and had no differences except the package name
Those were generated by infer_metadata_from_data and had only a different order of resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thx for the changes
Description
The function infer_metadata_from_data can be run on a datapackage where only the data/elements and data/sequences have been defined by a user and will generate automatically the datapackage.json file containing the metadata.
Type of change
Please tick or delete options that are not relevant.
Checklist:
Please tick or delete options that are not relevant.
pre-commit
hooks--> I couldn't get the pre-commit
local
tagged object to run, it told meisort
is unknown command even after I pip installed it ....