Skip to content

Commit

Permalink
Bugfix/faciliate redownload on source url change (#39)
Browse files Browse the repository at this point in the history
Should a source url change we want to facilitate redownload from the
source url. This bugfix enables that.

This PR looks like alot of file changes but there is really only two
modules that have simple changes (the parse function itself and the test
for the function), the rest of the changes are to the expected test data
for the integration tests.

REVIEWERS CONFIRM THE BELOW

Storing documents in the cdn: 

Should a source url change for the document, we would treat it as a new
document. The new cdn key / path would be generated, and the document
uploaded to the cdn. The cdn key / path is a function of the doc title
and md5sum. Thus, we should only ever have the exact same cdn path if
the content of the document is exactly the same. Thus we would silently
overwrite the pdf document stored in the cdn.

---------

Co-authored-by: Mark <mark@climatepolicyradar.org>
THOR300 and Mark authored Apr 13, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
1 parent e070ce1 commit 1af1b05
Showing 22 changed files with 421 additions and 66 deletions.
22 changes: 18 additions & 4 deletions HOW_TO_UPDATE_TESTS.md
Original file line number Diff line number Diff line change
@@ -22,20 +22,34 @@ Build the docker image locally

make build_test


MAKE SURE YOU HAVE THE CORRECT AWS CREDENTIALS SET UP.

export AWS_PROFILE=${PROFILE_NAME}

Set up the test buckets

python setup_test_buckets ${document_bucket} ${pipeline_bucket} ${region}
python -m integration_tests.setup_test_buckets ${document_bucket} ${pipeline_bucket} ${region}

Sync the test data to the s3 bucket

aws s3 sync integration_tests/data/pipeline_in s3://${pipeline_bucket}

Run the docker image
Run the docker image. If you are trying to figure out what the variables are look in the env var section of the following file: .github/workflows/integration-tests.yml. Also note that the prefixes used must match the subdirectory names of the data/pipeline_in directory.

docker run -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} -e API_HOST="" -e MACHINE_USER_EMAIL="" -e MACHINE_USER_PASSWORD="" navigator-data-ingest-test --pipeline-bucket ${PIPELINE_BUCKET} --document-bucket ${DOCUMENT_BUCKET} --input-file ${TEST_DATA_UPLOAD_PATH} --output-prefix ${OUTPUT_PREFIX} --embeddings-input-prefix ${EMBEDDINGS_INPUT_PREFIX} --indexer-input-prefix ${INDEXER_INPUT_PREFIX}

Assert that the output is correct and if so snyc the data locally to the pipeline_out directory
Example:

docker run -e AWS_ACCESS_KEY_ID=XXXX -e AWS_SECRET_ACCESS_KEY=XXXX -e API_HOST="" -e MACHINE_USER_EMAIL="" -e MACHINE_USER_PASSWORD="" navigator-data-ingest-test --pipeline-bucket pipbucket123123123 --document-bucket docbucket123123123 --input-file input/new_and_updated_documents.json --output-prefix ingest_unit_test_parser_input --embeddings-input-prefix ingest_unit_test_embeddings_input --indexer-input-prefix ingest_unit_test_indexer_input


Assert that the output is correct and if so manually delete all the files in the pipeline_out directory and sync the data locally to the pipeline_out directory

cd integration_tests/data/pipeline_out

aws s3 sync s3://${pipeline_bucket}/ .
aws s3 sync s3://${pipeline_bucket}/ .

Remove the test buckets

python -m integration_tests.remove_test_buckets ${document_bucket} ${pipeline_bucket} ${region}
Original file line number Diff line number Diff line change
@@ -838,6 +838,102 @@
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
},
{
"publication_ts": "2013-01-01T00:00:00",
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.",
"source_url": "http://existing.com",
"url": null,
"md5_sum": null,
"type": "EU Decision",
"source": "CCLW",
"import_id": "TESTCCLW.executive.3.3",
"category": "Law",
"frameworks": [],
"geography": "EUR",
"hazards": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
},
{
"publication_ts": "2013-01-01T00:00:00",
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "description",
"source_url": "http://existing.com",
"url": null,
"md5_sum": null,
"type": "EU Decision",
"source": "CCLW",
"import_id": "TESTCCLW.executive.4.4",
"category": "Law",
"frameworks": [],
"geography": "EUR",
"hazards": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
}
],
"updated_documents": {

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"document_name": "name",
"document_description": "description",
"document_id": "TESTCCLW.executive.3.3",
"document_source_url": "http://existing.com",
"document_cdn_object": null,
"document_content_type": "text/html",
"document_md5_sum": null,
"document_metadata": {},
"document_slug": "fake_slug",
"languages": [
"en"
],
"translated": false,
"html_data": {
"detected_title": "One Stop Shop Service",
"detected_date": null,
"has_valid_text": true,
"text_blocks": [
{
"text": [
"Why use a One Stop Shop"
],
"text_block_id": "b0",
"language": "en",
"type": "Text",
"type_confidence": 1.0
}
]
},
"pdf_data": null
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug", "languages": ["en"], "translated": false, "html_data": {"detected_title": "One Stop Shop Service", "detected_date": null, "has_valid_text": true, "text_blocks": [{"text": ["Why use a One Stop Shop"], "text_block_id": "b0", "language": "en", "type": "Text", "type_confidence": 1.0}]}, "pdf_data": null}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"document_name": "name",
"document_description": "description",
"document_id": "TESTCCLW.executive.3.3",
"document_source_url": "http://existing.com",
"document_cdn_object": null,
"document_content_type": "text/html",
"document_md5_sum": null,
"document_metadata": {},
"document_slug": "fake_slug",
"languages": [
"en"
],
"translated": false,
"html_data": {
"detected_title": "One Stop Shop Service",
"detected_date": null,
"has_valid_text": true,
"text_blocks": [
{
"text": [
"Why use a One Stop Shop"
],
"text_block_id": "b0",
"language": "en",
"type": "Text",
"type_confidence": 1.0
}
]
},
"pdf_data": null
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug", "languages": ["en"], "translated": false, "html_data": {"detected_title": "One Stop Shop Service", "detected_date": null, "has_valid_text": true, "text_blocks": [{"text": ["Why use a One Stop Shop"], "text_block_id": "b0", "language": "en", "type": "Text", "type_confidence": 1.0}]}, "pdf_data": null}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"document_name": "name",
"document_description": "description",
"document_id": "TESTCCLW.executive.3.3",
"document_source_url": "http://existing.com",
"document_cdn_object": null,
"document_content_type": "text/html",
"document_md5_sum": null,
"document_metadata": {},
"document_slug": "fake_slug"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://existing.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"}
Original file line number Diff line number Diff line change
@@ -1 +1,56 @@
{"document_name": "name", "document_description": "description", "document_id": "TESTCCLW.executive.3.3", "document_source_url": "http://new.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"}
{
"document_name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"document_description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.",
"document_id": "TESTCCLW.executive.3.3",
"document_source_url": "http://existing.com",
"document_cdn_object": null,
"document_content_type": null,
"document_md5_sum": null,
"document_metadata": {
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.",
"import_id": "TESTCCLW.executive.3.3",
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017",
"publication_ts": "2013-01-01T00:00:00",
"source_url": "http://existing.com",
"type": "EU Decision",
"source": "CCLW",
"category": "Law",
"geography": "EUR",
"frameworks": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"hazards": [],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
]
},
"document_slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
}
Original file line number Diff line number Diff line change
@@ -1 +1,56 @@
{"document_name": "name", "document_description": "new description", "document_id": "TESTCCLW.executive.4.4", "document_source_url": "http://new.com", "document_cdn_object": null, "document_content_type": "text/html", "document_md5_sum": null, "document_metadata": {}, "document_slug": "fake_slug"}
{
"document_name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"document_description": "description",
"document_id": "TESTCCLW.executive.4.4",
"document_source_url": "http://existing.com",
"document_cdn_object": null,
"document_content_type": null,
"document_md5_sum": null,
"document_metadata": {
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "description",
"import_id": "TESTCCLW.executive.4.4",
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017",
"publication_ts": "2013-01-01T00:00:00",
"source_url": "http://existing.com",
"type": "EU Decision",
"source": "CCLW",
"category": "Law",
"geography": "EUR",
"frameworks": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"hazards": [],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
]
},
"document_slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
}
Original file line number Diff line number Diff line change
@@ -838,6 +838,102 @@
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
},
{
"publication_ts": "2013-01-01T00:00:00",
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "The Decision no 1386/2013/EU sets up the General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet'. It adopts the '7th Environment Action programme' or \u20187th EAP'. The priority objectives of the 7th EAP are: (a) to protect, conserve and enhance the Union's natural capital; (b) to turn the Union into a resource-efficient, green and competitive low-carbon economy; (c) to safeguard the Union's citizens from environment-related pressures and risks to health and well-being; (d) to maximise the benefits of Union environment legislation by improving implementation; (e) to improve the knowledge and evidence base for Union environment policy; (f) to secure investment for environment and climate policy and address environmental externalities; (g) to improve environmental integration and policy coherence; (h) to enhance the sustainability of the Union's cities; (i) to increase the Union's effectiveness in addressing inter\u00ad national environmental and climate-related challenges.",
"source_url": "http://existing.com",
"url": null,
"md5_sum": null,
"type": "EU Decision",
"source": "CCLW",
"import_id": "TESTCCLW.executive.3.3",
"category": "Law",
"frameworks": [],
"geography": "EUR",
"hazards": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
},
{
"publication_ts": "2013-01-01T00:00:00",
"name": "DECISION No 1386/2013/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 20 November 2013 on a General Union Environment Action Programme to 2020 \u2018Living well, within the limits of our planet\u2019",
"description": "description",
"source_url": "http://existing.com",
"url": null,
"md5_sum": null,
"type": "EU Decision",
"source": "CCLW",
"import_id": "TESTCCLW.executive.4.4",
"category": "Law",
"frameworks": [],
"geography": "EUR",
"hazards": [],
"instruments": [
"Capacity building|Governance",
"Education, training and knowledge dissemination|Information"
],
"keywords": [
"Adaptation",
"Institutions / Administrative Arrangements",
"Research And Development",
"Energy Supply",
"Energy Demand",
"REDD+ And LULUCF",
"Transport"
],
"languages": [
"English"
],
"sectors": [
"Economy-wide",
"Health",
"Transport"
],
"topics": [
"Adaptation",
"Mitigation"
],
"events": [
{
"name": "Law passed",
"description": "",
"created_ts": "2013-11-20T00:00:00"
}
],
"slug": "european-union_2013_decision-no-13862013eu-of-the-european-parliament-and-of-the-council-of-20-november-2013-on-a-general-union-environment-action-programme-to-2020-living-well-within-the-limits-of-our-planet_8570_3017"
}
],
"updated_documents": {

Large diffs are not rendered by default.

33 changes: 2 additions & 31 deletions src/navigator_data_ingest/base/updated_document_actions.py
Original file line number Diff line number Diff line change
@@ -203,47 +203,18 @@ def parse(
"""
document_id, document_update = update
_LOGGER.info(
"Updating document so as to parse during the next run.",
"Archiving document so as to re-download from source and parse during the next run.",
extra={
"props": {
"document_id": document_id,
}
},
)
errors = []
for prefix_path in [
S3Path(
os.path.join(
"s3://", update_config.pipeline_bucket, update_config.parser_input
)
),
S3Path(
os.path.join(
"s3://", update_config.pipeline_bucket, update_config.embeddings_input
)
),
S3Path(
os.path.join(
"s3://", update_config.pipeline_bucket, update_config.indexer_input
)
),
]:
# Might be translated and non-translated json objects
document_files = get_document_files(
prefix_path, document_id, suffix_filter="json"
)
for document_file in document_files:
errors.append(
update_file_field(
document_path=document_file,
field=str(document_update.type.value),
new_value=document_update.csv_value,
existing_value=document_update.db_value,
)
)

timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
for prefix in [
update_config.parser_input,
update_config.embeddings_input,
update_config.indexer_input,
]:
10 changes: 1 addition & 9 deletions src/navigator_data_ingest/tests/test_update_actions.py
Original file line number Diff line number Diff line change
@@ -184,16 +184,8 @@ def test_parse(
for s3_key in s3_document_keys
]

assert parser_input_doc.exists()
assert not parser_input_doc.exists()
assert not embeddings_input_doc.exists()
assert not embeddings_input_translated_doc.exists()
assert not indexer_input_doc_json.exists()
assert not indexer_input_doc_npy.exists()

parser_input_doc_data = json.loads(parser_input_doc.read_text())
assert (
parser_input_doc_data[
PipelineFieldMapping[UpdateTypes(update_to_source_url.type)]
]
== update_to_source_url.csv_value
)

0 comments on commit 1af1b05

Please sign in to comment.