Update Bagnowka pipeline #71

Libisch · 2017-08-01T11:48:31Z

Update download and convert to allow sync and use all available data.

…ipelines into bagnowka_pipe

OriHoch

looks good, wrote some minor comments

regarding images - best to wait until I'm finished with #24 - that way we will know what's the best way to represent images

OriHoch · 2017-08-01T12:25:03Z

datapackage_pipelines_mojp/bagnowka/processors/convert.py

@@ -15,25 +15,41 @@ def _get_schema(self):
    def _filter_resource(self, resource_descriptor, resource_data):
        for cm_row in resource_data:
            dbs_row = self._cm_row_to_dbs_row(cm_row)
+            logging.info("hello CONVERT")


please remove, bear in mind that everything you log here appears in the pipeline dashboard

OriHoch · 2017-08-01T12:25:34Z

bagnowka/pipeline-spec.yaml

@@ -18,6 +18,43 @@ download:
      parameters:
        input-resource: bagnowka
        output-resource: dbs-docs
+    - run: add_metadata
+      parameters:
+        name: bagnowka-convert


I think you can remove all the add_metadata steps - they were required in older versions of the framework, but now I think they are not needed

OriHoch · 2017-08-01T12:25:54Z

bagnowka/pipeline-spec.yaml

+    #       source: bagnowka
+    # - run: dump.to_path
+    #   parameters:
+    #     out-path: ../data/bagnowka/post


please remove comments

OriHoch · 2017-08-01T12:26:32Z

datapackage_pipelines_mojp/bagnowka/processors/convert.py

-                }
+                   "id": cm_row["id"],
+                   "version": "one",
+                   "title": {"en": cm_row["name"]},


title attribute is meant for languages other then en / he

OriHoch · 2017-08-01T12:27:33Z

datapackage_pipelines_mojp/bagnowka/processors/convert.py

@@ -15,25 +15,41 @@ def _get_schema(self):
    def _filter_resource(self, resource_descriptor, resource_data):
        for cm_row in resource_data:
            dbs_row = self._cm_row_to_dbs_row(cm_row)
+            logging.info("hello CONVERT")
            yield dbs_row

    def _cm_row_to_dbs_row(self, cm_row):


the cm in cm_row is meant to signify clearmash
better to rename it to bagnowka_row or something like that to prevent confusion

OriHoch · 2017-08-01T12:30:31Z

datapackage_pipelines_mojp/bagnowka/processors/convert.py

+                   "content_html": {},
+                   "content_html_en": cm_row["desc"],
+                   "content_html_he": "",
+                   "related_documents": self.creat_img_urls(cm_row),


related_documents in dbs docs is meant to contain a list of
field name => list of document ids

where field_name is source dependant to group related documents by type

fixed, all images data (other than main image & thumbnail URL) is now only in "source_doc".

OriHoch · 2017-08-01T12:31:47Z

datapackage_pipelines_mojp/bagnowka/processors/download.py

@@ -27,29 +32,32 @@ def _get_resource(self):
            for item_data in all_docs:
                new = all_docs[item_data]
                doc = self.download(new)
-                logging.info("hello world")
+                logging.info("hello DOWNLOAD")


there is no point to log the same string every row
think about which log info will be useful in the future, and if it's needed to log every row at all, or better to log once before / after the iterations

fixed

Libisch added 4 commits July 31, 2017 10:31

Update download and convert

f302382

Merge branch 'master' of https://github.com/Beit-Hatfutsot/mojp-dbs-p…

0ebac83

…ipelines into bagnowka_pipe

Update convert and download

b4ca601

Merge branch 'master' of https://github.com/Beit-Hatfutsot/mojp-dbs-p…

a7ccca7

…ipelines into bagnowka_pipe

Libisch requested a review from OriHoch August 1, 2017 11:48

OriHoch previously requested changes Aug 1, 2017

View reviewed changes

fix convert.py

2347d0a

fix test_convert

c0842e5

Libisch requested a review from OriHoch August 1, 2017 14:52

OriHoch approved these changes Aug 2, 2017

View reviewed changes

update to current merge

ee638cc

Libisch merged commit b55405b into Beit-Hatfutsot:master Aug 2, 2017

Libisch mentioned this pull request Aug 9, 2017

Bagnowka should be added as a DBS source #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Bagnowka pipeline #71

Update Bagnowka pipeline #71

Libisch commented Aug 1, 2017

OriHoch left a comment

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

OriHoch Aug 1, 2017

Libisch Aug 1, 2017

Update Bagnowka pipeline #71

Update Bagnowka pipeline #71

Conversation

Libisch commented Aug 1, 2017

OriHoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment