-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bugfix/faciliate redownload on source url change (#39)
Should a source url change we want to facilitate redownload from the source url. This bugfix enables that. This PR looks like alot of file changes but there is really only two modules that have simple changes (the parse function itself and the test for the function), the rest of the changes are to the expected test data for the integration tests. REVIEWERS CONFIRM THE BELOW Storing documents in the cdn: Should a source url change for the document, we would treat it as a new document. The new cdn key / path would be generated, and the document uploaded to the cdn. The cdn key / path is a function of the doc title and md5sum. Thus, we should only ever have the exact same cdn path if the content of the document is exactly the same. Thus we would silently overwrite the pdf document stored in the cdn. --------- Co-authored-by: Mark <mark@climatepolicyradar.org>
- v2.5.1-beta
- v2.5.0-beta
- v2.4.8-beta
- v2.4.7-beta
- v2.4.6-beta
- v2.4.5-beta
- v2.4.4-beta
- v2.4.3-beta
- v2.4.2-beta
- v2.4.1-beta
- v2.4.0-beta
- v2.3.5-beta
- v2.3.4-beta
- v2.3.3-beta
- v2.3.2-beta
- v2.3.1-beta
- v2.3.0-beta
- v2.2.6-beta
- v2.1.6-beta
- v2.1.5-beta
- v2.1.4-beta
- v2.1.3-beta
- v2.1.2-beta
- v2.1.1-beta
- v2.1.0-beta
- v2.0.10-beta
- v2.0.9-beta
- v2.0.8-beta
- v2.0.7-beta
- v2.0.6-beta
- v2.0.5-beta
- v2.0.4-beta
- v2.0.3-beta
- v2.0.2-beta
Showing
22 changed files
with
421 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 0 additions & 1 deletion
1
...rchive/ingest_unit_test_embeddings_input/TESTCCLW.executive.3.3/2023-03-29-17-29-45..json
This file was deleted.
Oops, something went wrong.
32 changes: 32 additions & 0 deletions
32
...rchive/ingest_unit_test_embeddings_input/TESTCCLW.executive.3.3/2023-04-12-13-01-01..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{ | ||
"document_name": "name", | ||
"document_description": "description", | ||
"document_id": "TESTCCLW.executive.3.3", | ||
"document_source_url": "http://existing.com", | ||
"document_cdn_object": null, | ||
"document_content_type": "text/html", | ||
"document_md5_sum": null, | ||
"document_metadata": {}, | ||
"document_slug": "fake_slug", | ||
"languages": [ | ||
"en" | ||
], | ||
"translated": false, | ||
"html_data": { | ||
"detected_title": "One Stop Shop Service", | ||
"detected_date": null, | ||
"has_valid_text": true, | ||
"text_blocks": [ | ||
{ | ||
"text": [ | ||
"Why use a One Stop Shop" | ||
], | ||
"text_block_id": "b0", | ||
"language": "en", | ||
"type": "Text", | ||
"type_confidence": 1.0 | ||
} | ||
] | ||
}, | ||
"pdf_data": null | ||
} |
1 change: 0 additions & 1 deletion
1
...rchive/ingest_unit_test_embeddings_input/TESTCCLW.executive.4.4/2023-03-29-17-29-47..json
This file was deleted.
Oops, something went wrong.
1 change: 1 addition & 0 deletions
1
...rchive/ingest_unit_test_embeddings_input/TESTCCLW.executive.4.4/2023-04-12-13-01-06..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug", "languages": ["en"], "translated": false, "html_data": {"detected_title": "One Stop Shop Service", "detected_date": null, "has_valid_text": true, "text_blocks": [{"text": ["Why use a One Stop Shop"], "text_block_id": "b0", "language": "en", "type": "Text", "type_confidence": 1.0}]}, "pdf_data": null} |
File renamed without changes.
File renamed without changes.
1 change: 0 additions & 1 deletion
1
...t/archive/ingest_unit_test_indexer_input/TESTCCLW.executive.3.3/2023-03-29-17-29-45..json
This file was deleted.
Oops, something went wrong.
32 changes: 32 additions & 0 deletions
32
...t/archive/ingest_unit_test_indexer_input/TESTCCLW.executive.3.3/2023-04-12-13-01-01..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{ | ||
"document_name": "name", | ||
"document_description": "description", | ||
"document_id": "TESTCCLW.executive.3.3", | ||
"document_source_url": "http://existing.com", | ||
"document_cdn_object": null, | ||
"document_content_type": "text/html", | ||
"document_md5_sum": null, | ||
"document_metadata": {}, | ||
"document_slug": "fake_slug", | ||
"languages": [ | ||
"en" | ||
], | ||
"translated": false, | ||
"html_data": { | ||
"detected_title": "One Stop Shop Service", | ||
"detected_date": null, | ||
"has_valid_text": true, | ||
"text_blocks": [ | ||
{ | ||
"text": [ | ||
"Why use a One Stop Shop" | ||
], | ||
"text_block_id": "b0", | ||
"language": "en", | ||
"type": "Text", | ||
"type_confidence": 1.0 | ||
} | ||
] | ||
}, | ||
"pdf_data": null | ||
} |
File renamed without changes.
1 change: 0 additions & 1 deletion
1
...t/archive/ingest_unit_test_indexer_input/TESTCCLW.executive.4.4/2023-03-29-17-29-47..json
This file was deleted.
Oops, something went wrong.
File renamed without changes.
1 change: 1 addition & 0 deletions
1
...t/archive/ingest_unit_test_indexer_input/TESTCCLW.executive.4.4/2023-04-12-13-01-06..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug", "languages": ["en"], "translated": false, "html_data": {"detected_title": "One Stop Shop Service", "detected_date": null, "has_valid_text": true, "text_blocks": [{"text": ["Why use a One Stop Shop"], "text_block_id": "b0", "language": "en", "type": "Text", "type_confidence": 1.0}]}, "pdf_data": null} |
11 changes: 11 additions & 0 deletions
11
...ut/archive/ingest_unit_test_parser_input/TESTCCLW.executive.3.3/2023-04-12-13-01-01..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{ | ||
"document_name": "name", | ||
"document_description": "description", | ||
"document_id": "TESTCCLW.executive.3.3", | ||
"document_source_url": "http://existing.com", | ||
"document_cdn_object": null, | ||
"document_content_type": "text/html", | ||
"document_md5_sum": null, | ||
"document_metadata": {}, | ||
"document_slug": "fake_slug" | ||
} |
1 change: 1 addition & 0 deletions
1
...ut/archive/ingest_unit_test_parser_input/TESTCCLW.executive.4.4/2023-04-12-13-01-06..json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"} |
57 changes: 56 additions & 1 deletion
57
...gration_tests/data/pipeline_out/ingest_unit_test_parser_input/TESTCCLW.executive.3.3.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,56 @@ | ||
{"document_name": "name", "document_description": "description", "document_id": "TESTCCLW.executive.3.3", "document_source_url": "http://new.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"} | ||
{ | ||
"document_name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019", | ||
"document_description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.", | ||
"document_id": "TESTCCLW.executive.3.3", | ||
"document_source_url": "http://existing.com", | ||
"document_cdn_object": null, | ||
"document_content_type": null, | ||
"document_md5_sum": null, | ||
"document_metadata": { | ||
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019", | ||
"description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.", | ||
"import_id": "TESTCCLW.executive.3.3", | ||
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017", | ||
"publication_ts": "2013-01-01T00:00:00", | ||
"source_url": "http://existing.com", | ||
"type": "EU Decision", | ||
"source": "CCLW", | ||
"category": "Law", | ||
"geography": "EUR", | ||
"frameworks": [], | ||
"instruments": [ | ||
"Capacity building|Governance", | ||
"Education, training and knowledge dissemination|Information" | ||
], | ||
"hazards": [], | ||
"keywords": [ | ||
"Adaptation", | ||
"Institutions / Administrative Arrangements", | ||
"Research And Development", | ||
"Energy Supply", | ||
"Energy Demand", | ||
"REDD+ And LULUCF", | ||
"Transport" | ||
], | ||
"languages": [ | ||
"English" | ||
], | ||
"sectors": [ | ||
"Economy-wide", | ||
"Health", | ||
"Transport" | ||
], | ||
"topics": [ | ||
"Adaptation", | ||
"Mitigation" | ||
], | ||
"events": [ | ||
{ | ||
"name": "Law passed", | ||
"description": "", | ||
"created_ts": "2013-11-20T00:00:00" | ||
} | ||
] | ||
}, | ||
"document_slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017" | ||
} |
57 changes: 56 additions & 1 deletion
57
...gration_tests/data/pipeline_out/ingest_unit_test_parser_input/TESTCCLW.executive.4.4.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,56 @@ | ||
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://new.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"} | ||
{ | ||
"document_name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019", | ||
"document_description": "description", | ||
"document_id": "TESTCCLW.executive.4.4", | ||
"document_source_url": "http://existing.com", | ||
"document_cdn_object": null, | ||
"document_content_type": null, | ||
"document_md5_sum": null, | ||
"document_metadata": { | ||
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019", | ||
"description": "description", | ||
"import_id": "TESTCCLW.executive.4.4", | ||
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017", | ||
"publication_ts": "2013-01-01T00:00:00", | ||
"source_url": "http://existing.com", | ||
"type": "EU Decision", | ||
"source": "CCLW", | ||
"category": "Law", | ||
"geography": "EUR", | ||
"frameworks": [], | ||
"instruments": [ | ||
"Capacity building|Governance", | ||
"Education, training and knowledge dissemination|Information" | ||
], | ||
"hazards": [], | ||
"keywords": [ | ||
"Adaptation", | ||
"Institutions / Administrative Arrangements", | ||
"Research And Development", | ||
"Energy Supply", | ||
"Energy Demand", | ||
"REDD+ And LULUCF", | ||
"Transport" | ||
], | ||
"languages": [ | ||
"English" | ||
], | ||
"sectors": [ | ||
"Economy-wide", | ||
"Health", | ||
"Transport" | ||
], | ||
"topics": [ | ||
"Adaptation", | ||
"Mitigation" | ||
], | ||
"events": [ | ||
{ | ||
"name": "Law passed", | ||
"description": "", | ||
"created_ts": "2013-11-20T00:00:00" | ||
} | ||
] | ||
}, | ||
"document_slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
34 changes: 18 additions & 16 deletions
34
integration_tests/data/pipeline_out/input/new_and_updated_documents.json_errors
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters