Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

HAWQ-1660. Optimize parquet scan when bloom filter enabled. #1397

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kuien
Copy link
Contributor

@kuien kuien commented Sep 20, 2018

No description provided.

@linwen
Copy link

linwen commented Sep 21, 2018

It is a good optimization point. If a lot of columns will be projected, we can only fetch joinkey and do a bloomfilter check, if doesn't match, no need to fetch other columns.

But in this PR, if bloomfilter is not enable, it will fetch joinkey in the first loop, and fetch other columns in the second loop, which needs a little refine further.

}

/* skip those attributes not in given list */
if (attsList != NIL && list_find_int(attsList, i) >= 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, should be < 0

@interma
Copy link
Member

interma commented Oct 2, 2018

@kuien I do a perf test on your pr, two issues:

  1. query result error
  2. performance downgrade

Details see below, please check code, thanks.

TPCH1G data on my mac, master code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 3150.873 ms
tpch=# set hawq_hashjoin_bloomfilter to on;
SET
Time: 2.903 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 1512.782 ms

your code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
  6088
(1 row)

Time: 49466.999 ms #<-- result ok, but bad performance
tpch=# set hawq_hashjoin_bloomfilter to on;                                                                             SET
Time: 13.106 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
     0 #<-- result error
(1 row)

Time: 1888.176 ms 

Copy link
Member

@interma interma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix issues.

@interma
Copy link
Member

interma commented Oct 2, 2018

@kuien
Btw: If also test on mac, you can generate tpch data via my dbgen tools:
https://github.com/interma/misc/tree/master/hawq/tpch_mac

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants