Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the reprocess update type. #154

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 37 additions & 10 deletions src/cpr_sdk/pipeline_general_models.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from datetime import datetime
from enum import Enum
from typing import Mapping, Any, List, Optional, Sequence, Union
from typing import Any, List, Mapping, Optional, Sequence, Union

from pydantic import BaseModel, field_validator

Expand Down Expand Up @@ -68,22 +68,39 @@ class InputData(BaseModel):


class UpdateTypes(str, Enum):
"""Document types supported by the backend API."""
"""
UpdateTypes that are recognised and have resulting actions in the pipeline.

A mapping of the update type to the action can be found in the ingest repo:
https://github.com/climatepolicyradar/navigator-data-ingest/blob/main/src/
navigator_data_ingest/base/updated_document_actions.py#L490

Attributes:
NAME (str): Represents the name of the document, causes embeddings generation to
be re-triggered for a document.
DESCRIPTION (str): Represents the description of the document, causes embeddings
generation to be re-triggered for a document.
SLUG (str): Represents the slug (a URL-friendly version of the name) of the
document, triggers an update of the field in the relating s3 objects such
that the new data is reflected in vespa.
SOURCE_URL (str): Represents the source URL of the document and triggers full
reprocessing and download from source of the document.
METADATA (str): Represents the metadata associated with the document and
indicates that the metadata of the objects in s3 relating to the document
should be updated.
REPARSE (str): Indicates that the document should be reparsed, including full
reprocessing but not redownload from source.
REPROCESS (str): Indicates that the document should be reprocessed, including
redownload from source and reparse.
"""

NAME = "name"
DESCRIPTION = "description"
# IMPORT_ID = "import_id"
SLUG = "slug"
# PUBLICATION_TS = "publication_ts"
SOURCE_URL = "source_url"
# TYPE = "type"
# SOURCE = "source"
# CATEGORY = "category"
# GEOGRAPHY = "geography"
# LANGUAGES = "languages"
# DOCUMENT_STATUS = "document_status"
METADATA = "metadata"
REPARSE = "reparse"
REPROCESS = "reprocess"
THOR300 marked this conversation as resolved.
Show resolved Hide resolved


class Update(BaseModel):
Expand All @@ -109,3 +126,13 @@ class ExecutionData(BaseModel):
"""Data unique to a step functions execution that is required at later stages."""

input_dir_path: str


class DocUpdateConfig(BaseModel):
"""
Config for updates not defined as part of IdentifyUpdates.

reprocess_updates: list of document ids to reprocess.
"""

reprocess_updates: list[str]
THOR300 marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion src/cpr_sdk/version.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
_MAJOR = "1"
_MINOR = "9"
_PATCH = "6"
_PATCH = "7"
_SUFFIX = ""

VERSION_SHORT = "{0}.{1}".format(_MAJOR, _MINOR)
Expand Down
Loading