Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Bagnowka pipeline #71

Merged
merged 7 commits into from
Aug 2, 2017
Merged

Conversation

Libisch
Copy link
Contributor

@Libisch Libisch commented Aug 1, 2017

Update download and convert to allow sync and use all available data.

@Libisch Libisch requested a review from OriHoch August 1, 2017 11:48
OriHoch
OriHoch previously requested changes Aug 1, 2017
Copy link
Contributor

@OriHoch OriHoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, wrote some minor comments

regarding images - best to wait until I'm finished with #24 - that way we will know what's the best way to represent images

@@ -15,25 +15,41 @@ def _get_schema(self):
def _filter_resource(self, resource_descriptor, resource_data):
for cm_row in resource_data:
dbs_row = self._cm_row_to_dbs_row(cm_row)
logging.info("hello CONVERT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove, bear in mind that everything you log here appears in the pipeline dashboard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@@ -18,6 +18,43 @@ download:
parameters:
input-resource: bagnowka
output-resource: dbs-docs
- run: add_metadata
parameters:
name: bagnowka-convert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove all the add_metadata steps - they were required in older versions of the framework, but now I think they are not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

# source: bagnowka
# - run: dump.to_path
# parameters:
# out-path: ../data/bagnowka/post
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}
"id": cm_row["id"],
"version": "one",
"title": {"en": cm_row["name"]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title attribute is meant for languages other then en / he

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -15,25 +15,41 @@ def _get_schema(self):
def _filter_resource(self, resource_descriptor, resource_data):
for cm_row in resource_data:
dbs_row = self._cm_row_to_dbs_row(cm_row)
logging.info("hello CONVERT")
yield dbs_row

def _cm_row_to_dbs_row(self, cm_row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cm in cm_row is meant to signify clearmash
better to rename it to bagnowka_row or something like that to prevent confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"content_html": {},
"content_html_en": cm_row["desc"],
"content_html_he": "",
"related_documents": self.creat_img_urls(cm_row),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related_documents in dbs docs is meant to contain a list of
field name => list of document ids

where field_name is source dependant to group related documents by type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, all images data (other than main image & thumbnail URL) is now only in "source_doc".

@@ -27,29 +32,32 @@ def _get_resource(self):
for item_data in all_docs:
new = all_docs[item_data]
doc = self.download(new)
logging.info("hello world")
logging.info("hello DOWNLOAD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no point to log the same string every row
think about which log info will be useful in the future, and if it's needed to log every row at all, or better to log once before / after the iterations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@Libisch Libisch requested a review from OriHoch August 1, 2017 14:52
@Libisch Libisch merged commit b55405b into Beit-Hatfutsot:master Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants