Only clean up ES indices when single indexer running #658

jsharkey13 · 2024-11-15T16:25:50Z

If more than one indexing process is happening, cleaning up unaliased indices is not a safe operation. There are several different ways the threads can conflict!
Much of the problem is caused by the fact each "version" is a collection of 5 indices {content, content_error, metadata, published_unit, unit} which each must get an alias. Each of these indices is created, and then aliased, separately.

Job A is in the process of creating a new version's collection of ES indices; none of them have aliases yet. Job B concurrently reaches the cleanup stage, finds the indices Job A is adding data to, sees they are unaliased and deletes them. This causes Job A to fail, but this failure should be noticed and fail safe(ish).
Job A has created all of the collection of indices for a new version, and is starting to move the various alias pointers. Job B reaches the cleanup stage, finds some mixture of two versions' indices without aliases and deletes them. This may cause Job A to fail, but will also leave the not-yet-aliased-as previous version in an invalid state; this version is being used by the API and the live site is broken.

This code now uses a thread-safe counter to skip running the deletion if another indexing job is running. However, this operation is not itself thread safe; it may take some time between the check and the response from ElasticSearch listing the indices and their aliases, in which time a new indexing job may have started creating indices. So we check again after getting the index list request that there are still no other concurrent jobs. This should catch all plausible cases.

Without truly serialising the whole ETL mechanics, I do not think we can make it perfect - but I believe this is a net improvement.

The first commit contains all the logic changes, later commits clean up some logging issues spotted during the work.

If more than one indexing process is happening, cleaning up unaliased indices is not a safe operation. There are several different ways the threads can conflict! Much of the problem is caused by the fact each "version" is a collection of 5 indices {content, content_error, metadata, published_unit, unit} which each must get an alias. Each of these indices is created, and then aliased, separately. 1) Job A is in the process of creating a new version's collection of ES indices; none of them have aliases yet. Job B concurrently reaches the cleanup stage, finds the indices Job A is adding data to, sees they are unaliased and deletes them. This causes Job A to fail, but this failure should be noticed and fail safe(ish). 2) Job A has created all of the collection of indices for a new version, and is starting to move the various alias pointers. Job B reaches the cleanup stage, finds some mixture of two versions' indices without aliases and deletes them. This may cause Job A to fail, but will also leave the not-yet-aliased-as previous version in an invalid state; this version is being used by the API and the live site is broken. This code now uses a thread-safe counter to skip running the deletion if another indexing job is running. However, this operation is not itself thread safe; it may take some time between the check and the response from ElasticSearch listing the indices and their aliases, in which time a new indexing job may have started creating indices. So we check again after getting the index list request that there are still no other concurrent jobs. This should catch all plausible cases. Without truly serialising the whole ETL mechanics, I do not think we can make it perfect - but I believe this is a net improvement.

codecov · 2024-11-15T16:29:05Z

Codecov Report

Attention: Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.

Project coverage is 34.42%. Comparing base (cdbb748) to head (37cbf43).
Report is 21 commits behind head on master.

Files with missing lines	Patch %	Lines
...in/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java	0.00%	18 Missing ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java	0.00%	11 Missing ⚠️
...ava/uk/ac/cam/cl/dtg/segue/etl/ContentIndexer.java	0.00%	2 Missing ⚠️
...ain/java/uk/ac/cam/cl/dtg/segue/etl/ETLFacade.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
+ Coverage   34.39%   34.42%   +0.02%     
==========================================
  Files         521      521              
  Lines       23378    23426      +48     
  Branches     2863     2870       +7     
==========================================
+ Hits         8042     8065      +23     
- Misses      14528    14549      +21     
- Partials      808      812       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java

mwtrew

If I'm understanding this, we don't mind whether the indexing step happens concurrently across multiple threads, but we want to stop any deletion step happening while others are indexing. If so, I think we can (mis?)use ReadWriteLock for this and it will be a bit simpler. Read locks are not mutually exclusive, so can be used for the indexing step. Write locks can't be acquired while any read or write locks are held, so can be used for the deletion step.

tryLock() for the write lock means the thread won't wait if the lock can't be acquired, it will just continue.

For example:

    private static ReadWriteLock indexLock = new ReentrantReadWriteLock();

(...)

    void setNamedVersion(final String alias, final String version) throws Exception {
        log.info("Requested new aliased version: {} - {}", alias, version);

        try {
            indexLock.readLock().lock();
            indexer.loadAndIndexContent(version);
            log.info("Indexed version {}. Setting alias '{}'.", version, alias);
            indexer.setNamedVersion(alias, version);
            // Store the alias to file so that we can recover after wiping ElasticSearch.
            this.contentIndicesStore.saveProperty(alias, version);
        } finally {
            indexLock.readLock().unlock();
        }

        if (indexLock.writeLock().tryLock()) {
            indexer.deleteAllUnaliasedIndices();
            indexLock.writeLock().unlock();
        }
    }

jsharkey13 · 2024-11-20T17:39:00Z

This does initially look more like what I wanted; some primitive that mapped to the use case! It didn't occur to me that the writing here maps to reading, and the deletion maps to writing, in a ReadWriteLock.

The most dangerous step is that if we are inside deleteAllUnaliasedIndices(), we need no new thread to be able to start indexing whilst we're inside it, and it looks like this does that.

I guess we also need to release the write lock in a try-finally too, right?

mwtrew · 2024-11-21T10:03:34Z

I guess we also need to release the write lock in a try-finally too, right?

On reflection yes, no harm in doing so and we wouldn't want the write lock to be held forever if something went wrong!

This is basically a move from an optimistic try-and-abort-if-necessary, to a pessimistic block-or-skip-if-necessary. But the use of a primitive feels nicer than the previous DIY job!

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java

jsharkey13 added 2 commits November 15, 2024 16:31

Move some logging to use parameterised logging

910c337

Optimise imports of ETL-related files

7165c73

github-advanced-security bot found potential problems Nov 15, 2024

View reviewed changes

jsharkey13 requested a review from mwtrew November 18, 2024 09:49

mwtrew reviewed Nov 20, 2024

View reviewed changes

Move from AtomicInteger checks to a lock primitive

37cbf43

This is basically a move from an optimistic try-and-abort-if-necessary, to a pessimistic block-or-skip-if-necessary. But the use of a primitive feels nicer than the previous DIY job!

jsharkey13 requested a review from mwtrew November 21, 2024 10:56

github-advanced-security bot found potential problems Nov 21, 2024

View reviewed changes

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java Dismissed Show dismissed Hide dismissed

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java Dismissed Show dismissed Hide dismissed

mwtrew approved these changes Nov 21, 2024

View reviewed changes

mwtrew merged commit b915f20 into master Nov 21, 2024
3 checks passed

mwtrew deleted the hotfix/improve-etl-threadsafety branch November 21, 2024 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only clean up ES indices when single indexer running #658

Only clean up ES indices when single indexer running #658

jsharkey13 commented Nov 15, 2024

codecov bot commented Nov 15, 2024 •

edited

Loading

mwtrew left a comment •

edited

Loading

jsharkey13 commented Nov 20, 2024

mwtrew commented Nov 21, 2024

Only clean up ES indices when single indexer running #658

Only clean up ES indices when single indexer running #658

Conversation

jsharkey13 commented Nov 15, 2024

codecov bot commented Nov 15, 2024 • edited Loading

Codecov Report

mwtrew left a comment • edited Loading

Choose a reason for hiding this comment

jsharkey13 commented Nov 20, 2024

mwtrew commented Nov 21, 2024

codecov bot commented Nov 15, 2024 •

edited

Loading

mwtrew left a comment •

edited

Loading