Add NOAA benchmark #30

martijnvg · 2017-07-03T14:51:50Z

This now benchmarks range fields specifically, but it can also be used to benchmark other numeric query/agg operations.

jpountz

I left some comments but I'm glad we are getting a benchmark that has range fields.

jpountz · 2017-07-03T17:16:24Z

noaa/challenges/default.json

+      "description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green and we want to ensure that we don't use the query cache. Document ids are unique so all index operations are append only. After that a couple of queries are run.",
+      "default": true,
+      "index-settings": {
+        "index.number_of_shards": 1,


jpountz · 2017-07-03T17:18:57Z

noaa/operations/default.json

+                    "ASN00003105",
+                    "ASN00003100",
+                    "ASN00004083"
+                  ]


can we have a simple term query like the disjunction has? Otherwise if there is a change in performance in that query, it might not be obvious whether it is related to the terms query or to the range?

if we want to benchmark both the point and the doc values query, it might also help to have one conjunction with a range that matches most documents and a term query that matches between 0.1 and 1%% of the index, and another conjunction where the range matches 2x fewer documents than the range.

can we have a simple term query like the disjunction has?

I think you missed the range_query_range_field_in_conjunction_with_term_query query above this one?

it might also help to have one conjunction with a range that matches most document
and a term query that matches between 0.1 and 1%% of the index,

A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.

What query could be used for matching most of the docs, that on its own doesn't have a lot of overhead that could interfere with the benchmark? A term range? match_all ?

A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.

Arg, I made a mistake. A simple term query for weather station is 0,003%. The terms query matches with 5856 documents and that is 0,05%. So what I'll do is increase the number terms in the the terms query to get at least to 0,1%

I'm a bit worried that the overhead of merging postings of multiple terms will add noise. Maybe we could cross this dataset with stations (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt) in order to be able to index more metadata with all documents such as geo coordinates, state and elevation of the station. Then I believe we could find some states that have significant numbers of records?

👍 I'll add more metadata to the documents.

++ @martijnvg I can update my python script to do this if you want?

@colings86 Thanks that would be great. Note that for creating this track I did made some modifications to your script, mainly around the fact that it needs to be converted to a json file. This is what I have now: https://gist.github.com/martijnvg/72a3711cb26fd84f196e9a1c4a41d038

jpountz · 2017-07-03T17:19:41Z

noaa/track.json

+
+{
+  "short-description": "Daily weather measurement summaries from around the globe.",
+  "description": "Indexes 10M+ weather measurement summaries from NOAA.",


maybe document where the data was retrieved?

I added a link in the README, I think that is sufficient?

oh right, I missed it!

The comment says 10M+ weather measurements but it's actually only 2.5M.

@danielmitterdorfer The doc count is actually 10914068, so I'll just update it to that. I would expect Rally to fail with an error, because the document-count in track.json was incorrect.

Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.

danielmitterdorfer

Thanks for contributing the track! I left a few minor comments.

danielmitterdorfer · 2017-07-04T12:48:57Z

noaa/README.txt

+Dataset containing daily weather measurement from NOAA:
+ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
+
+The dataset has been processed by: https://gist.github.com/colings86/078e85a1131324471f4f10c73570d678


Can you just compress the zip file from the gist and dump it here so it is self-contained? Also, the gist contains instructions, especially:

Sort files using something like sort --field-separator=',' --key=1,2 -o ~/Downloads/2017-sorted.csv ~/Downloads/2017.csv

And I think you should document how you sorted the files.

danielmitterdorfer · 2017-07-04T12:51:09Z

noaa/challenges/default.json

+        {
+          "operation": "index",
+          "#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.",
+          "warmup-time-period": 10,


Is this short warmup time period warranted here? I think this is only necessary for percolator (where indexing throughput is not interesting anyway). Ideally we'd have at least 240 seconds here.

danielmitterdorfer · 2017-07-04T12:52:10Z

noaa/challenges/default.json

+        {
+          "operation": "index",
+          "#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.",
+          "warmup-time-period": 10,


Same as above for the warmup time period. If possible this should be at least 240 seconds.

danielmitterdorfer · 2017-07-04T12:53:54Z

noaa/track.json

+
+{
+  "short-description": "Daily weather measurement summaries from around the globe.",
+  "description": "Indexes 10M+ weather measurement summaries from NOAA.",


The comment says 10M+ weather measurements but it's actually only 2.5M.

danielmitterdorfer · 2017-07-04T12:54:16Z

noaa/challenges/default.json

+          "clients": 8
+        }
+      ]
+    }


Nit: Missing new line

danielmitterdorfer · 2017-07-04T12:54:27Z

noaa/operations/default.json

+          }
+        }
+      }
+    }


Nit: Missing new line

martijnvg · 2017-07-04T14:51:02Z

I've updated the PR.

danielmitterdorfer

LGTM

danielmitterdorfer · 2017-07-04T15:04:26Z

noaa/track.json

+
+{
+  "short-description": "Daily weather measurement summaries from around the globe.",
+  "description": "Indexes 10M+ weather measurement summaries from NOAA.",


Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.

jpountz · 2017-07-04T16:19:23Z

@martijnvg could you map the station code as a keyword so that it does not get the text/keyword dual mapping? In general, I think it'd be better to map all fields explicitly and disable dynamic mappings.

martijnvg · 2017-07-04T18:03:31Z

@jpountz yes: 659d697

martijnvg added enhancement Review labels Jul 3, 2017

martijnvg requested review from colings86, jpountz and danielmitterdorfer July 3, 2017 14:51

jpountz approved these changes Jul 3, 2017

View reviewed changes

danielmitterdorfer requested changes Jul 4, 2017

View reviewed changes

danielmitterdorfer approved these changes Jul 4, 2017

View reviewed changes

Added noaa benchmark

bf5b21b

martijnvg force-pushed the noaa branch from fec7b18 to bf5b21b Compare July 4, 2017 15:37

martijnvg merged commit bf5b21b into elastic:master Jul 4, 2017

martijnvg mentioned this pull request Jul 5, 2017

Query range fields by doc values when they are expected to be more efficient than points elastic/elasticsearch#24823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NOAA benchmark #30

Add NOAA benchmark #30

martijnvg commented Jul 3, 2017

jpountz left a comment

jpountz Jul 3, 2017

jpountz Jul 3, 2017

jpountz Jul 3, 2017

martijnvg Jul 4, 2017

martijnvg Jul 4, 2017

jpountz Jul 4, 2017

martijnvg Jul 4, 2017

colings86 Jul 4, 2017

martijnvg Jul 4, 2017

jpountz Jul 3, 2017

martijnvg Jul 4, 2017

jpountz Jul 4, 2017

jpountz Jul 4, 2017

danielmitterdorfer Jul 4, 2017

martijnvg Jul 4, 2017

danielmitterdorfer Jul 4, 2017 •

edited

Loading

danielmitterdorfer left a comment

danielmitterdorfer Jul 4, 2017

danielmitterdorfer Jul 4, 2017

danielmitterdorfer Jul 4, 2017

danielmitterdorfer Jul 4, 2017

danielmitterdorfer Jul 4, 2017

danielmitterdorfer Jul 4, 2017

martijnvg commented Jul 4, 2017

danielmitterdorfer left a comment

danielmitterdorfer Jul 4, 2017 •

edited

Loading

jpountz commented Jul 4, 2017

martijnvg commented Jul 4, 2017

Add NOAA benchmark #30

Add NOAA benchmark #30

Conversation

martijnvg commented Jul 3, 2017

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer Jul 4, 2017 • edited Loading

Choose a reason for hiding this comment

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jul 4, 2017

danielmitterdorfer left a comment

Choose a reason for hiding this comment

danielmitterdorfer Jul 4, 2017 • edited Loading

Choose a reason for hiding this comment

jpountz commented Jul 4, 2017

martijnvg commented Jul 4, 2017

danielmitterdorfer Jul 4, 2017 •

edited

Loading

danielmitterdorfer Jul 4, 2017 •

edited

Loading