-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data set mismatch: empty queries results and reproducibility issues #11
Comments
@Bouncner - Happy to incorporate any changes to make this better/easier. |
Hey, I also met this problem in Postgres! Firstly, thanks for your great work! I run JOB in PostgreSQL 15 and found queries I am also curious about if this is done by intention. If not, will you try to replace such queries with empty results? |
@HennyNile : are you using the static data set that is listed in the initial paper? Also, take a look at this pull request. We contacted Viktor Leis some time ago and he confirmed that there are empty results and they are on purpose. But I think |
@Bouncner For convenience, I just used a dumped dataset in Harvard website which even has fewer rows than the initial dataset you mentioned. Maybe I should change to use the initial dataset. There is no problem
In PG, it is
So, I think the empty results in Postgres are not caused by this problem. For 2c, there are only 2 filter predicates and I found both predicates only fetch one row. This may lead to an empty result. There are 5 empty results in my experiments. In my opinion, the empty results of my experiments are intentional or caused by the wrong dumped dataset. When I test the initial dataset, I will share the result with you. |
Viktor Leis wrote to us "If I remember correctly, there are around 10 queries with empty results sets.". He strongly suggested to use only the data set used in the paper. At least for us, that removed a number of empty results. |
I see. I will use the initial dataset used in the paper in the later work. Many thanks! |
Sorry, that took a while. :( |
We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.
One example is query 32a, which selects
k.keyword ='10,000-mile-club'
. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather10000-mile-club
(see also here). Interestingly, this keyword with the comma exists in the CWI data set.As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?
For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).
Pinging @Bensk1 here as well as he did the data generation test.
The text was updated successfully, but these errors were encountered: