Skip to content

Commit

Permalink
Introduce gcp translation(V3) glossaries providers (#45085)
Browse files Browse the repository at this point in the history
Operators:
- TranslateCreateGlossaryOperator
- TranslateUpdateGlossaryOperator
- TranslateListGlossariesOperator
- TranslateDeleteGlossaryOperator

This set of operators allows to work with custom translation
dictionaries (glossaries), using Google Cloud Translate V3 API.

Co-authored-by: Oleg Kachur <[email protected]>
  • Loading branch information
olegkachur-e and Oleg Kachur authored Dec 20, 2024
1 parent 5d77395 commit 3af90fd
Show file tree
Hide file tree
Showing 8 changed files with 1,099 additions and 1 deletion.
86 changes: 86 additions & 0 deletions docs/apache-airflow-providers-google/operators/cloud/translate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,92 @@ Basic usage of the operator:
:end-before: [END howto_operator_translate_document_batch]


.. _howto/operator:TranslateCreateGlossaryOperator:

TranslateCreateGlossaryOperator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Create a translation glossary, using Cloud Translate API (Advanced V3).

For parameter definition, take a look at
:class:`~airflow.providers.google.cloud.operators.translate.TranslateCreateGlossaryOperator`

Using the operator
""""""""""""""""""

Basic usage of the operator:

.. exampleinclude:: /../../providers/tests/system/google/cloud/translate/example_translate_glossary.py
:language: python
:dedent: 4
:start-after: [START howto_operator_translate_create_glossary]
:end-before: [END howto_operator_translate_create_glossary]


.. _howto/operator:TranslateUpdateGlossaryOperator:

TranslateUpdateGlossaryOperator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Updates translation glossary, using Cloud Translate API (Advanced V3).
Only ``display_name`` and ``input_config`` fields available for update.
By updating input_config - the glossary dictionary updates.

For parameter definition, take a look at
:class:`~airflow.providers.google.cloud.operators.translate.TranslateUpdateGlossaryOperator`

Using the operator
""""""""""""""""""

Basic usage of the operator:

.. exampleinclude:: /../../providers/tests/system/google/cloud/translate/example_translate_glossary.py
:language: python
:dedent: 4
:start-after: [START howto_operator_translate_update_glossary]
:end-before: [END howto_operator_translate_update_glossary]


.. _howto/operator:TranslateListGlossariesOperator:

TranslateListGlossariesOperator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
List all available translation glossaries on the project.

For parameter definition, take a look at
:class:`~airflow.providers.google.cloud.operators.translate.TranslateListGlossariesOperator`

Using the operator
""""""""""""""""""

Basic usage of the operator:

.. exampleinclude:: /../../providers/tests/system/google/cloud/translate/example_translate_glossary.py
:language: python
:dedent: 4
:start-after: [START howto_operator_translate_list_glossaries]
:end-before: [END howto_operator_translate_list_glossaries]


.. _howto/operator:TranslateDeleteGlossaryOperator:

TranslateDeleteGlossaryOperator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Delete the translation glossary resource.

For parameter definition, take a look at
:class:`~airflow.providers.google.cloud.operators.translate.TranslateDeleteGlossaryOperator`

Using the operator
""""""""""""""""""

Basic usage of the operator:

.. exampleinclude:: /../../providers/tests/system/google/cloud/translate/example_translate_glossary.py
:language: python
:dedent: 4
:start-after: [START howto_operator_translate_delete_glossary]
:end-before: [END howto_operator_translate_delete_glossary]


More information
""""""""""""""""""
See:
Expand Down
1 change: 1 addition & 0 deletions docs/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ autoscale
autoscaled
autoscaler
autoscaling
available
avp
Avro
avro
Expand Down
231 changes: 231 additions & 0 deletions providers/src/airflow/providers/google/cloud/hooks/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from google.api_core.retry import Retry
from google.cloud.translate_v2 import Client
from google.cloud.translate_v3 import TranslationServiceClient
from google.cloud.translate_v3.types.translation_service import GlossaryInputConfig

from airflow.exceptions import AirflowException
from airflow.providers.google.common.consts import CLIENT_INFO
Expand All @@ -51,6 +52,7 @@
TransliterationConfig,
automl_translation,
)
from google.cloud.translate_v3.types.translation_service import Glossary
from proto import Message


Expand Down Expand Up @@ -915,3 +917,232 @@ def batch_translate_document(
retry=retry,
metadata=metadata,
)

def create_glossary(
self,
project_id: str,
location: str,
glossary_id: str,
input_config: GlossaryInputConfig | dict,
language_pair: Glossary.LanguageCodePair | dict | None = None,
language_codes_set: Glossary.LanguageCodesSet | MutableSequence[str] | None = None,
retry: Retry | _MethodDefault = DEFAULT,
timeout: float | None = None,
metadata: Sequence[tuple[str, str]] = (),
) -> Operation:
"""
Create the glossary resource from the input source file.
:param project_id: ID of the Google Cloud project where dataset is located. If not provided
default project_id is used.
:param location: The location of the project.
:param glossary_id: User-specified id to built glossary resource name.
:param input_config: The input configuration of examples to built glossary from.
Total glossary must not exceed 10M Unicode codepoints.
The headers should not be included into the input file table, as languages specified with the
``language_pair`` or ``language_codes_set`` params.
:param language_pair: Pair of language codes to be used for glossary creation.
Used to built unidirectional glossary. If specified, the ``language_codes_set`` should be empty.
:param language_codes_set: Set of language codes to create the equivalent term sets glossary.
Meant multiple languages mapping. If specified, the ``language_pair`` should be empty.
:param retry: A retry object used to retry requests. If `None` is specified, requests will not be
retried.
:param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if
`retry` is specified, the timeout applies to each individual attempt.
:param metadata: Additional metadata that is provided to the method.
:return: `Operation` object with the glossary creation results.
"""
client = self.get_client()
parent = f"projects/{project_id}/locations/{location}"
name = f"projects/{project_id}/locations/{location}/glossaries/{glossary_id}"

result = client.create_glossary(
request={
"parent": parent,
"glossary": {
"name": name,
"input_config": input_config,
"language_pair": language_pair,
"language_codes_set": language_codes_set,
},
},
retry=retry,
timeout=timeout,
metadata=metadata,
)
return result

def get_glossary(
self,
project_id: str,
location: str,
glossary_id: str,
retry: Retry | _MethodDefault = DEFAULT,
timeout: float | None = None,
metadata: Sequence[tuple[str, str]] = (),
) -> Glossary:
"""
Fetch glossary item data by the given id.
The glossary_id is a substring of glossary name, following the format:
``projects/{project-number-or-id}/locations/{location-id}/glossaries/{glossary-id}``
:param project_id: ID of the Google Cloud project where dataset is located. If not provided
default project_id is used.
:param location: The location of the project.
:param glossary_id: User-specified id to built glossary resource name.
:param retry: A retry object used to retry requests. If `None` is specified, requests will not be
retried.
:param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if
`retry` is specified, the timeout applies to each individual attempt.
:param metadata: Additional metadata that is provided to the method.
:return: Fetched glossary item.
"""
client = self.get_client()
name = f"projects/{project_id}/locations/{location}/glossaries/{glossary_id}"
result = client.get_glossary(
name=name,
retry=retry,
timeout=timeout,
metadata=metadata,
)
if not result:
raise AirflowException(f"Fail to get glossary {name}! Please check if it exists.")
return result

def update_glossary(
self,
glossary: Glossary,
new_display_name: str | None = None,
new_input_config: GlossaryInputConfig | dict | None = None,
retry: Retry | _MethodDefault = DEFAULT,
timeout: float | None = None,
metadata: Sequence[tuple[str, str]] = (),
) -> Operation:
"""
Update glossary item with values provided.
Only ``display_name`` and ``input_config`` fields are allowed for update.
:param glossary: Glossary item to update.
:param new_display_name: New value of the ``display_name`` to be updated.
:param new_input_config: New value of the ``input_config`` to be updated.
:param retry: A retry object used to retry requests. If `None` is specified, requests will not be
retried.
:param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if
`retry` is specified, the timeout applies to each individual attempt.
:param metadata: Additional metadata that is provided to the method.
:return: `Operation` with glossary update results.
"""
client = self.get_client()
updated_fields = []
if new_display_name:
glossary.display_name = new_display_name
updated_fields.append("display_name")
if new_input_config is not None:
if isinstance(new_input_config, dict):
new_input_config = GlossaryInputConfig(**new_input_config)
glossary.input_config = new_input_config
updated_fields.append("input_config")
result = client.update_glossary(
request={"glossary": glossary, "update_mask": {"paths": updated_fields}},
retry=retry,
timeout=timeout,
metadata=metadata,
)
return result

def list_glossaries(
self,
project_id: str,
location: str,
page_size: int | None = None,
page_token: str | None = None,
filter_str: str | None = None,
retry: Retry | _MethodDefault = DEFAULT,
timeout: float | None = None,
metadata: Sequence[tuple[str, str]] = (),
) -> pagers.ListGlossariesPager:
"""
Get the list of glossaries available.
:param project_id: ID of the Google Cloud project where dataset is located. If not provided
default project_id is used.
:param location: The location of the project.
:param page_size: Page size requested, if not set server use appropriate default.
:param page_token: A token identifying a page of results the server should return.
The first page is returned if ``page_token`` is empty or missing.
:param filter_str: Filter specifying constraints of a list operation. Specify the constraint by the
format of "key=value", where key must be ``src`` or ``tgt``, and the value must be a valid
language code.
For multiple restrictions, concatenate them by "AND" (uppercase only), such as:
``src=en-US AND tgt=zh-CN``. Notice that the exact match is used here, which means using 'en-US'
and 'en' can lead to different results, which depends on the language code you used when you
create the glossary.
For the unidirectional glossaries, the ``src`` and ``tgt`` add restrictions
on the source and target language code separately.
For the equivalent term set glossaries, the ``src`` and/or ``tgt`` add restrictions on the term set.
For example: ``src=en-US AND tgt=zh-CN`` will only pick the unidirectional glossaries which exactly
match the source language code as ``en-US`` and the target language code ``zh-CN``, but all
equivalent term set glossaries which contain ``en-US`` and ``zh-CN`` in their language set will
be picked.
If missing, no filtering is performed.
:param retry: A retry object used to retry requests. If `None` is specified, requests will not be
retried.
:param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if
`retry` is specified, the timeout applies to each individual attempt.
:param metadata: Additional metadata that is provided to the method.
:return: Glossaries list pager object.
"""
client = self.get_client()
parent = f"projects/{project_id}/locations/{location}"
result = client.list_glossaries(
request={
"parent": parent,
"page_size": page_size,
"page_token": page_token,
"filter": filter_str,
},
retry=retry,
timeout=timeout,
metadata=metadata,
)
return result

def delete_glossary(
self,
project_id: str,
location: str,
glossary_id: str,
retry: Retry | _MethodDefault = DEFAULT,
timeout: float | None = None,
metadata: Sequence[tuple[str, str]] = (),
) -> Operation:
"""
Delete the glossary item by the given id.
:param project_id: ID of the Google Cloud project where dataset is located. If not provided
default project_id is used.
:param location: The location of the project.
:param glossary_id: Glossary id to be deleted.
:param retry: A retry object used to retry requests. If `None` is specified, requests will not be
retried.
:param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if
`retry` is specified, the timeout applies to each individual attempt.
:param metadata: Additional metadata that is provided to the method.
:return: `Operation` with glossary deletion results.
"""
client = self.get_client()
name = f"projects/{project_id}/locations/{location}/glossaries/{glossary_id}"
result = client.delete_glossary(
name=name,
retry=retry,
timeout=timeout,
metadata=metadata,
)
return result
28 changes: 28 additions & 0 deletions providers/src/airflow/providers/google/cloud/links/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@
)
TRANSLATION_MODELS_LIST_LINK = TRANSLATION_BASE_LINK + "/models/list?project={project_id}"

TRANSLATION_HUB_RESOURCES_LIST_LINK = TRANSLATION_BASE_LINK + "/hub/resources?project={project_id}"


class TranslationLegacyDatasetLink(BaseGoogleLink):
"""
Expand Down Expand Up @@ -368,3 +370,29 @@ def persist(
),
},
)


class TranslationGlossariesListLink(BaseGoogleLink):
"""
Helper class for constructing Translation Glossaries List link.
Link for the list of available glossaries.
"""

name = "Translation Glossaries List"
key = "translation_glossaries_list"
format_str = TRANSLATION_HUB_RESOURCES_LIST_LINK

@staticmethod
def persist(
context: Context,
task_instance,
project_id: str,
):
task_instance.xcom_push(
context,
key=TranslationGlossariesListLink.key,
value={
"project_id": project_id,
},
)
Loading

0 comments on commit 3af90fd

Please sign in to comment.