-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Bagnowka pipeline #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, wrote some minor comments
regarding images - best to wait until I'm finished with #24 - that way we will know what's the best way to represent images
@@ -15,25 +15,41 @@ def _get_schema(self): | |||
def _filter_resource(self, resource_descriptor, resource_data): | |||
for cm_row in resource_data: | |||
dbs_row = self._cm_row_to_dbs_row(cm_row) | |||
logging.info("hello CONVERT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove, bear in mind that everything you log here appears in the pipeline dashboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
bagnowka/pipeline-spec.yaml
Outdated
@@ -18,6 +18,43 @@ download: | |||
parameters: | |||
input-resource: bagnowka | |||
output-resource: dbs-docs | |||
- run: add_metadata | |||
parameters: | |||
name: bagnowka-convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove all the add_metadata steps - they were required in older versions of the framework, but now I think they are not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
bagnowka/pipeline-spec.yaml
Outdated
# source: bagnowka | ||
# - run: dump.to_path | ||
# parameters: | ||
# out-path: ../data/bagnowka/post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
} | ||
"id": cm_row["id"], | ||
"version": "one", | ||
"title": {"en": cm_row["name"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title attribute is meant for languages other then en / he
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -15,25 +15,41 @@ def _get_schema(self): | |||
def _filter_resource(self, resource_descriptor, resource_data): | |||
for cm_row in resource_data: | |||
dbs_row = self._cm_row_to_dbs_row(cm_row) | |||
logging.info("hello CONVERT") | |||
yield dbs_row | |||
|
|||
def _cm_row_to_dbs_row(self, cm_row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cm
in cm_row is meant to signify clearmash
better to rename it to bagnowka_row
or something like that to prevent confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"content_html": {}, | ||
"content_html_en": cm_row["desc"], | ||
"content_html_he": "", | ||
"related_documents": self.creat_img_urls(cm_row), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related_documents in dbs docs is meant to contain a list of
field name => list of document ids
where field_name is source dependant to group related documents by type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, all images data (other than main image & thumbnail URL) is now only in "source_doc".
@@ -27,29 +32,32 @@ def _get_resource(self): | |||
for item_data in all_docs: | |||
new = all_docs[item_data] | |||
doc = self.download(new) | |||
logging.info("hello world") | |||
logging.info("hello DOWNLOAD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no point to log the same string every row
think about which log info will be useful in the future, and if it's needed to log every row at all, or better to log once before / after the iterations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
Update download and convert to allow sync and use all available data.