Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: reduce memory usage and GC overhead for hash join. #2957

Merged
merged 3 commits into from
Mar 30, 2017

Conversation

coocood
Copy link
Member

@coocood coocood commented Mar 29, 2017

Implemented and use MVMap to reduce GC overhead and memory usage for hash join.

Implemented and use MVMap to reduce GC overhead and memory usage for hash join.
@coocood
Copy link
Member Author

coocood commented Mar 29, 2017

benchmark result will be added later.

e.hashTable[string(hashcode)] = []*Row{row}
} else {
e.hashTable[string(hashcode)] = append(rows, row)
buffer = buffer[:0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse the buffer memory.

executor/join.go Outdated
if err != nil {
return nil, errors.Trace(err)
}
return b, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return b, errors.Trace(err)

executor/join.go Outdated

func (e *HashJoinExec) encodeRowKey(b []byte, rowKey *RowKeyEntry) []byte {
b = codec.EncodeVarint(b, rowKey.Handle)
for i, tn := range e.tableNames {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every row's table name is the same. Need we encode it?

m.hashFunc = fnv.New64()
m.entryStore.slices = [][]entry{make([]entry, 0, 64)}
// append first empty entry so zero entry pointer an represent null.
m.entryStore.put(entry{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put nullEntryAddr

func (ds *dataStore) put(key, value []byte) dataAddr {
dataLen := uint32(len(key) + len(value))
if ds.sliceLen != 0 && ds.sliceLen+dataLen > maxDataSliceLen {
ds.slices = append(ds.slices, make([]byte, 0, maxDataSliceLen))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cap should be max(maxDataSliceLen, dataLen) ?

hashKey := m.hash(key)
entryAddr := m.hashTable[hashKey]
for entryAddr != nullEntryAddr {
he := m.entryStore.get(entryAddr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why name he? It makes me confused.

for entryAddr != nullEntryAddr {
he := m.entryStore.get(entryAddr)
entryAddr = he.next
k, v := m.dataStore.get(he)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we implement compare and get ? If key is different with k, don't fetch the v.

// MVMap stores multiple value for a given key with minimum GC overhead.
// A given key can store multiple values.
// It is not thread-safe, should only be used in one goroutine.
type MVMap struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is complex, what's the benefit versus this?

type KeyType []byte
type ValueType [][]byte
map[KeyType]ValueType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key type and value type in the map don't have pointer, Go garbage collector simply skip the map.

if err != nil {
return nil, errors.Trace(err)
}
row.Data = values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with Row struct, it has a Data and a RowKeys fields, what's the difference?

@tiancaiamao
Copy link
Contributor

Does the encode/decode overhead cover the GC cost after change Row to []byte ?

@coocood
Copy link
Member Author

coocood commented Mar 30, 2017

Tested join two tables with 1 million rows, the TiDB process memory has increased to 4GB, a little bit smaller than master, this probably due to the GC is slower than memory allocation.

./benchdb -run="create|insert:0_1000000"
./benchdb -table="benchdb2" -run="create|insert:0_1000000"
select * from benchdb, benchdb2 where benchdb.id = benchdb2.id and benchdb.exp + benchdb2.exp < 0;

The performance is about 10% faster than master when hash table size is 1 million, but slower for smaller hash table size.

I'll try to implement lazy decode on hash join, and see the difference.

@hanfei1991
Copy link
Member

LGTM

@hanfei1991 hanfei1991 added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 30, 2017
@coocood
Copy link
Member Author

coocood commented Mar 30, 2017

After remove the extra memory allocation.
The benchmark result is:

memory:

master: 3.3GB
encode-row: 3.0GB

time:

master: 7.6s            
encode-row: 6.8s

@tiancaiamao
Copy link
Contributor

LGTM

@coocood coocood merged commit a168f41 into master Mar 30, 2017
@coocood coocood deleted the coocood/hashjoin-encode-row branch March 30, 2017 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants