Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import DataPackages #29

Closed
5 of 6 tasks
vitorbaptista opened this issue Dec 8, 2015 · 6 comments
Closed
5 of 6 tasks

Import DataPackages #29

vitorbaptista opened this issue Dec 8, 2015 · 6 comments
Assignees

Comments

@vitorbaptista
Copy link
Contributor

  • Add UI for importing
  • Import the Data Package's metadata
  • Import the resources, including their metadata
  • Upload local resources
  • Upload inline resources
  • Make sure there's no way to upload a datapackage.json with a resource like { "path": "/etc/shadow" } or other local file. The paths MUST be only relative to its zip file.
@vitorbaptista vitorbaptista self-assigned this Dec 8, 2015
@vitorbaptista vitorbaptista added this to the Importing and Exporting Data Packages on CKAN 2.4 milestone Dec 8, 2015
@vitorbaptista
Copy link
Contributor Author

What to do with homepage attribute? According to the specification, it refers to "URL string for the data packages web site". When exporting to a DataPackage, we set the homepage to be the ckan_url (i.e. the URL of the dataset in CKAN). It doesn't work on the other direction (i.e. we can't set the ckan_url).

It sounds like the source, but we also have the sources field. Maybe we need a more complicated rule, like:

  1. If homepage exists, set source = homepage;
  2. Else, set source = sources[0].web.

I'm not sure about that, though, because sources in datapackages seems to refer to the authors of the data (i.e. the author in CKAN).

@vitorbaptista
Copy link
Contributor Author

What about the datapackage sources? As we've done in #27, the CKAN dataset author, author_email, and source becomes the first item in the sources array of the datapackage. What to do if there're multiple sources?

The solution seems to involve extras. We could simply add any remaining source into an extra sources key (or another one harder to clash with existing names), but then we have to make sure that they're a valid JSON with a list of objects.

@vitorbaptista
Copy link
Contributor Author

There're a few issues with datapackage's keywords. They're mapped to CKAN's tags. My idea was to use the tag name as the slugified keyword and the display_name as the keyword itself. I planned this because a tag name can be only alphanumeric plus _.- characters, while a dataset keyword doesn't have these limitations. The problem is that I can't set the display_name through CKAN package_create. I created an issue about it at ckan/ckan#2781.

vitorbaptista added a commit that referenced this issue Dec 9, 2015
Currently, it just creates a dataset, not its resources.
vitorbaptista added a commit that referenced this issue Dec 9, 2015
@vitorbaptista
Copy link
Contributor Author

I've been trying to upload the datapackage's resources to CKAN but couldn't figure out yet. I've found a couple issues:

  • The package_create() docs say that it accepts a resources list which attributes are as the ones defined on resource_create(). Then resource_create() says that it accepts a upload attribute to upload a file, which is true, but not when calling through package_create(). As far as I can see, package_create() doesn't use resource_create() at all, but model_save.package_dict_save() which calls model_save.resource_dict_save(), which doesn't accept the upload attribute. That being the case, we would have to create the package with package_create, then call resource_create on each resource (at least the ones we want to upload).
  • Uploading a resource requires a cgi.FieldStorage, which seems to be created by Pylons when receiving a POST request with data. I was hoping to be able to manually create one of those and pass to CKAN, but haven't been able so far. I might need to write a custom class that inherits from cgi.FieldStorage and pass it instead.

@amercader do you know a better way? https://github.com/ckan/ckanapi is able to upload datasets because it uses HTTP requests, which I'm trying to avoid because I think it would open another can of worms.

vitorbaptista added a commit that referenced this issue Dec 15, 2015
We'll need the object itself to gather information on the local paths for local
resources. It's also way better to pass classes around instead of using pure
dicts.
@vitorbaptista
Copy link
Contributor Author

The tests are currently failing because of an issue with CKAN solved in ckan/ckan#2801

vitorbaptista added a commit that referenced this issue Dec 21, 2015
If the inlined data is a string, upload it as is. If not, dump it as a json
string and upload. This leaves a few important cases unhandled, like a list that
represents a CSV.
vitorbaptista added a commit that referenced this issue Dec 21, 2015
CKAN's datasets can only have lowercase names
vitorbaptista added a commit that referenced this issue Dec 21, 2015
This code doesn't actually ensure the `name` uniqueness, as there can be a
chance the random name exists as well. That change is slim, though, as it
generates the name based both on its data package's name and a random number
out of 10 billion possibilities.
vitorbaptista added a commit that referenced this issue Dec 21, 2015
They weren't being passed down to the package create method, and `private`
wasn't handled by the API.
vitorbaptista added a commit that referenced this issue Dec 21, 2015
Since d75c376, the API takes care of guaranteeing that the `name` is unique. If
the user wants to change it, she can edit the dataset after importing.
vitorbaptista added a commit that referenced this issue Dec 22, 2015
For a better explanation about what is an unsafe datapackage, check
frictionlessdata/datapackage-py#24
vitorbaptista added a commit that referenced this issue Dec 22, 2015
The format of the extras was wrong.
vitorbaptista added a commit that referenced this issue Dec 22, 2015
This is required for CKAN resources
vitorbaptista added a commit that referenced this issue Dec 23, 2015
@vitorbaptista
Copy link
Contributor Author

Uploading inline resources work partially. The issue is described in #34.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant