Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bagnowka should be added as a DBS source #67

Closed
Libisch opened this issue Jul 23, 2017 · 3 comments
Closed

Bagnowka should be added as a DBS source #67

Libisch opened this issue Jul 23, 2017 · 3 comments
Assignees

Comments

@Libisch
Copy link
Contributor

Libisch commented Jul 23, 2017

Overview

  • Bagnowka is an old website which will not be updated.
  • As such, there is no need to process it's data more than ones.
  • All the data is scraped, currently available in a single json file.
  • All images were uploaded to an AWS s3 bucket.
    • Image URLs are already in the json file, only for the main_image.
    • Since this is an external source, the main_image is simply the first image (or the only image, in case there is just one).
    • Images other than main_image have IDs, which can be used for the full image_url path (by using the same base url as the main_image).

Next Step
Determine how to convert and sync to ES: build a new pipeline or handle it differently, since it's a one time job.

@Libisch
Copy link
Contributor Author

Libisch commented Jul 23, 2017

@OriHoch?

@OriHoch
Copy link
Contributor

OriHoch commented Jul 24, 2017

you need to add a bagnowka pipeline spec (under bagnowka/pipeline-spec.yaml)

this pipeline should load the bagnowka json file and return common docs which can be synced to ES

this is very similar to what we are doing with clearmash - just instead of loading from making API calls, you load from a json file

you can see examples of pipelines specs here - https://github.com/OpenBudget/budgetkey-data-pipelines/tree/master/budgetkey_data_pipelines/pipelines
and this is the documentation of the datapackage_pipelines framework - https://github.com/frictionlessdata/datapackage-pipelines

@Libisch Libisch self-assigned this Jul 24, 2017
@Libisch
Copy link
Contributor Author

Libisch commented Aug 9, 2017

Done (PR #71), moving on to #87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants