Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problems in GeoIPv2 code #71598

Merged
merged 6 commits into from
Apr 13, 2021
Merged

Conversation

probakowski
Copy link
Contributor

This change fixes number of problems in GeoIPv2 code:

  • closes streams from Files.list in GeoIpCli, which should fix tests on Windows
  • makes sure that total download time in GeoIP stats is non-negative (we serialize it as vInt which can cause problems with negative numbers and it can happen when clock was changed during operation)
  • fixes handling of failed/simultaneous downloads, Use OpType.CREATE in GeoIpDownloader #69951 was meant as a way to prevent 2 persistent tasks to index chunks but it would prevent any update if single download failed mid indexing, this change uses timestamp (lastUpdate) as sort of UUID. This should still prevent 2 tasks to step on each other toes (overwriting chunks) but in the end still only single task should be able to update task state (this is handled by persistent tasks framework)

Closes #71145

@probakowski probakowski added >non-issue :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.13.0 labels Apr 12, 2021
@probakowski probakowski requested a review from martijnvg April 12, 2021 21:49
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two small comments. Otherwise LGTM

@@ -219,6 +219,9 @@ private static XContentBuilder mappings() {
.startObject("chunk")
.field("type", "integer")
.endObject()
.startObject("timestamp")
.field("type", "long")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use type date? This still accepts time in ms since epoch and treats values as date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as timestamp is part of the id now it doesn't need to be indexed separately. I've removed it from mapping

MessageDigest md = MessageDigests.md5();
for (byte[] buf = getChunk(is); buf.length != 0; buf = getChunk(is)) {
md.update(buf);
client.prepareIndex(DATABASES_INDEX).setId(name + "_" + chunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the _id but with timestamp? That way the _id has meaning and if due to some issue we index a document with the same _id then we fail with an error (b/c create=true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched back to using _id with added timestamp as you suggested

@probakowski probakowski merged commit 46efa6a into elastic:master Apr 13, 2021
@probakowski probakowski deleted the geoip-bugs branch April 13, 2021 15:10
probakowski added a commit to probakowski/elasticsearch that referenced this pull request Apr 13, 2021
This change fixes number of problems in GeoIPv2 code:

- closes streams from Files.list in GeoIpCli, which should fix tests on Windows
- makes sure that total download time in GeoIP stats is non-negative (we serialize it as vInt which can cause problems with negative numbers and it can happen when clock was changed during operation)
- fixes handling of failed/simultaneous downloads, elastic#69951 was meant as a way to prevent 2 persistent tasks to index chunks but it would prevent any update if single download failed mid indexing, this change uses timestamp (lastUpdate) as sort of UUID. This should still prevent 2 tasks to step on each other toes (overwriting chunks) but in the end still only single task should be able to update task state (this is handled by persistent tasks framework)
Closes elastic#71145
# Conflicts:
#	modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloader.java
probakowski added a commit that referenced this pull request Apr 13, 2021
* Fix problems in GeoIPv2 code (#71598)

This change fixes number of problems in GeoIPv2 code:

- closes streams from Files.list in GeoIpCli, which should fix tests on Windows
- makes sure that total download time in GeoIP stats is non-negative (we serialize it as vInt which can cause problems with negative numbers and it can happen when clock was changed during operation)
- fixes handling of failed/simultaneous downloads, #69951 was meant as a way to prevent 2 persistent tasks to index chunks but it would prevent any update if single download failed mid indexing, this change uses timestamp (lastUpdate) as sort of UUID. This should still prevent 2 tasks to step on each other toes (overwriting chunks) but in the end still only single task should be able to update task state (this is handled by persistent tasks framework)
Closes #71145
# Conflicts:
#	modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloader.java

* fix compilation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >non-issue Team:Data Management Meta label for data/management team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] GeoIpCliTests classMethod failing on Windows
4 participants