-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NOAA benchmark #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but I'm glad we are getting a benchmark that has range fields.
"description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green and we want to ensure that we don't use the query cache. Document ids are unique so all index operations are append only. After that a couple of queries are run.", | ||
"default": true, | ||
"index-settings": { | ||
"index.number_of_shards": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
noaa/operations/default.json
Outdated
"ASN00003105", | ||
"ASN00003100", | ||
"ASN00004083" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a simple term
query like the disjunction has? Otherwise if there is a change in performance in that query, it might not be obvious whether it is related to the terms query or to the range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to benchmark both the point and the doc values query, it might also help to have one conjunction with a range that matches most documents and a term query that matches between 0.1 and 1%% of the index, and another conjunction where the range matches 2x fewer documents than the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a simple term query like the disjunction has?
I think you missed the range_query_range_field_in_conjunction_with_term_query
query above this one?
it might also help to have one conjunction with a range that matches most document
and a term query that matches between 0.1 and 1%% of the index,
A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.
What query could be used for matching most of the docs, that on its own doesn't have a lot of overhead that could interfere with the benchmark? A term range? match_all ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.
Arg, I made a mistake. A simple term query for weather station is 0,003%. The terms
query matches with 5856 documents and that is 0,05%. So what I'll do is increase the number terms in the the terms
query to get at least to 0,1%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried that the overhead of merging postings of multiple terms will add noise. Maybe we could cross this dataset with stations (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt) in order to be able to index more metadata with all documents such as geo coordinates, state and elevation of the station. Then I believe we could find some states that have significant numbers of records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll add more metadata to the documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ @martijnvg I can update my python script to do this if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colings86 Thanks that would be great. Note that for creating this track I did made some modifications to your script, mainly around the fact that it needs to be converted to a json file. This is what I have now: https://gist.github.com/martijnvg/72a3711cb26fd84f196e9a1c4a41d038
|
||
{ | ||
"short-description": "Daily weather measurement summaries from around the globe.", | ||
"description": "Indexes 10M+ weather measurement summaries from NOAA.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe document where the data was retrieved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a link in the README, I think that is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, I missed it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 10M+ weather measurements but it's actually only 2.5M.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielmitterdorfer The doc count is actually 10914068, so I'll just update it to that. I would expect Rally to fail with an error, because the document-count
in track.json
was incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing the track! I left a few minor comments.
noaa/README.txt
Outdated
Dataset containing daily weather measurement from NOAA: | ||
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ | ||
|
||
The dataset has been processed by: https://gist.github.com/colings86/078e85a1131324471f4f10c73570d678 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just compress the zip file from the gist and dump it here so it is self-contained? Also, the gist contains instructions, especially:
Sort files using something like sort --field-separator=',' --key=1,2 -o ~/Downloads/2017-sorted.csv ~/Downloads/2017.csv
And I think you should document how you sorted the files.
noaa/challenges/default.json
Outdated
{ | ||
"operation": "index", | ||
"#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.", | ||
"warmup-time-period": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this short warmup time period warranted here? I think this is only necessary for percolator (where indexing throughput is not interesting anyway). Ideally we'd have at least 240 seconds here.
noaa/challenges/default.json
Outdated
{ | ||
"operation": "index", | ||
"#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.", | ||
"warmup-time-period": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for the warmup time period. If possible this should be at least 240 seconds.
|
||
{ | ||
"short-description": "Daily weather measurement summaries from around the globe.", | ||
"description": "Indexes 10M+ weather measurement summaries from NOAA.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 10M+ weather measurements but it's actually only 2.5M.
noaa/challenges/default.json
Outdated
"clients": 8 | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Missing new line
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Missing new line
I've updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
{ | ||
"short-description": "Daily weather measurement summaries from around the globe.", | ||
"description": "Indexes 10M+ weather measurement summaries from NOAA.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.
@martijnvg could you map the station code as a keyword so that it does not get the text/keyword dual mapping? In general, I think it'd be better to map all fields explicitly and disable dynamic mappings. |
This now benchmarks range fields specifically, but it can also be used to benchmark other numeric query/agg operations.