Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not wait for reindex completion on filtered index creation #2980

Open
AetherUnbound opened this issue Sep 4, 2023 · 3 comments
Open

Do not wait for reindex completion on filtered index creation #2980

AetherUnbound opened this issue Sep 4, 2023 · 3 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Collaborator

Problem

The filtered index creation has recently been throttled due to its affect on production API performance (#2975). This has extended the time it takes to complete the create_and_populate_filtered_index step, namely the reindex call here:

self.es.reindex(
body={
"source": {
"index": source_index,
"query": {
"bool": {
"must_not": [
# Use `terms` query for exact matching against
# unanalyzed raw fields
{"terms": {f"{field}.raw": sensitive_terms}}
for field in ["tags.name", "title", "description"]
]
}
},
},
"dest": {"index": destination_index},
},
slices="auto",
wait_for_completion=True,
)

The step appears to have a default timeout of 43200 seconds (12 hours) per a recent exception:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 1378, in getresponse
    response.begin()
  File "/usr/local/lib/python3.11/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 357, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='openverse-es-8-8-2-elasticsearch-production.private', port=9200): Read timed out. (read timeout=43200)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/elasticsearch/connection/http_requests.py", line 166, in perform_request
    response = self.session.send(prepared_request, **send_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='openverse-es-8-8-2-elasticsearch-production.private', port=9200): Read timed out. (read timeout=43200)

Description

We should remove the wait_for_completion=True parameter of reindex and instead wait on the task using Elasticsearch's task management API (or using existing alternative mechanisms the ingestion server might have at its disposal to do so). This will require adding steps in the create filtered media index DAG in order to wait on the step to complete before issuing the refresh command (which ensures replicas exist). We may also need to add a REFRESH action to the ingestion server API which can be called by Airflow once the reindex step is complete.

Alternatives

We could alternatively override the request_timeout parameter available to all elasticsearch-py methods to a value greater than 43200. This could be a short-term workaround.

Additional context

@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 4, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Sep 4, 2023
@AetherUnbound AetherUnbound added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Sep 4, 2023
@krysal
Copy link
Member

krysal commented Sep 5, 2023

I'd refrain from creating more actions in the Ingestion Server if we can skip it, making the Airflow DAG interact directly with Elasticsearch (ES). If we want the quick fix to continue running the data refresh, then increasing the request_timeout sounds like the best option now.

@AetherUnbound
Copy link
Collaborator Author

That's true - the operations for the create_and_populate_filtered_index function are simple enough that we could bring them wholesale into Airflow and manage them there entirely!

@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon ⛔ status: blocked Blocked & therefore, not ready for work and removed 🟧 priority: high Stalls work on the project or its dependents labels Sep 26, 2023
@AetherUnbound
Copy link
Collaborator Author

Since we're planning on moving the logic itself into Airflow, this is blocked by #2370

@krysal krysal removed the ⛔ status: blocked Blocked & therefore, not ready for work label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants