Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Remove pre-buffering stage from direct ingest IMPORT #39271

Merged
merged 1 commit into from
Aug 4, 2019

Conversation

adityamaru27
Copy link
Contributor

This change removes the pre-buffering step in the direct ingest
IMPORT code path. Previously, we would create separate buckets
for each table's primary data, and when the bucket would be full
we would flush it to the BulkAdder. Running an import on 3 nodes
of tpcc 1K OOM'ed as a result of this buffer.

Two big wins we got from this pre-buffering stage were:

  1. We avoided worst case overlapping behavior in the AddSSTable
    calls as a result of flushing keys with the same TableIDIndexID
    prefix, together.

  2. Secondary index KVs which were few and filled the bucket
    infrequently were flushed only a few times, resulting in fewer
    L0 (and total) files.

In order to resolve this OOM, we decided to take advantage of the
split keys we insert across AllIndexSpans of each table during
IMPORT. Since the BulkAdder is split aware and does not allow
SSTables to span across splits, we already achieve the
non-overlapping property we strive for (as mentioned above).
The downside is we lose the second win, as the KVs fed to the
BulkAdder are now ungrouped. This results in larger number of
smaller SSTs being flushed, causing a spike in L0 and
total number of files, but overall less memory usage.

This change also ENABLES the import/experimental-direct-ingestion
roachtest.

TODO: Currently experimenting using two adders, one for primary
indexes and one for secondary indexes. This helps us achieve the
second win as well. Will have a follow up PR once the roachtest
stabilizes.

Release note: None

@adityamaru27 adityamaru27 requested review from dt and a team August 2, 2019 18:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru27
Copy link
Contributor Author

The import/experimental-direct-ingestion did not OOM following this change, the impact on time taken was also negligible. There is an issue of under-replicated ranges when using direct ingest but that is a parallel issue being tracked at #37014.

@dt
Copy link
Member

dt commented Aug 2, 2019

Nice!

Can you include before/after stats for tpcc import in the commit? might be nice to have total time taken to import, peak mem usage and maybe total compaction GB if you still have those handy.

Just locally, single node I saw pretty comperable tpcc1k fixtures import timings with and without pre-buffering, but MBP I/O is pretty good compared to roachprod machines.

@adityamaru27
Copy link
Contributor Author

Nice!

Can you include before/after stats for tpcc import in the commit? might be nice to have total time taken to import, peak mem usage and maybe total compaction GB if you still have those handy.

Just locally, single node I saw pretty comperable tpcc1k fixtures import timings with and without pre-buffering, but MBP I/O is pretty good compared to roachprod machines.

Done.

@petermattis
Copy link
Collaborator

Just locally, single node I saw pretty comperable tpcc1k fixtures import timings with and without pre-buffering, but MBP I/O is pretty good compared to roachprod machines.

Interestingly, MBP I/O is horrific in comparison to prod machines. MBP I/O appears faster, but it actually isn't if you want real durability. The fsync call on Darwin lies and doesn't actually sync to disk. Go1.12 fixed this (see golang/go#26650), but we only use fsync via RocksDB which is still using the very fast and very unsafe fsync, not fcntl(..., F_FULLSYNC).

The take-away remains the same. We should test on prod machines and not rely on numbers from MBP laptops.

This change removes the pre-buffering step in the direct ingest
IMPORT code path. Previously, we would create separate buckets
for each table's primary data, and when the bucket would be full
we would flush it to the BulkAdder. Running an import on 3 nodes
of tpcc 1K OOM'ed as a result of this buffer.

Two big wins we got from this pre-buffering stage were:
1. We avoided worst case overlapping behavior in the AddSSTable
calls as a result of flushing keys with the same TableIDIndexID
prefix, together.

2. Secondary index KVs which were few and filled the bucket
infrequently were flushed only a few times, resulting in fewer
L0 (and total) files.

In order to resolve this OOM, we decided to take advantage of the
split keys we insert across AllIndexSpans of each table during
IMPORT. Since the BulkAdder is split aware and does not allow
SSTables to span across splits, we already achieve the
non-overlapping property we strive for (as mentioned above).
The downside is we lose the second win, as the KVs fed to the
BulkAdder are now ungrouped. This results in larger  number of
smaller SSTs being flushed, causing a spike in L0 and
total number of files, but overall less memory usage.

Some statistics for IMPORT tpcc 1k on 1 node setup with and
without pre-buffering:

`without pre-buffering:`

`Time`: 1h19m
`Peak mem`: 6.2 GiB
`Cumulative Compaction (GB)`: 52.43 GB

`with pre-buffering:`
`Time`: 1h13m
`Peak mem`: 12.2 GiB
`Cumulative Compaction (GB)`: 24.54 GiB

This change also ENABLES the `import/experimental-direct-ingestion`
roachtest.

TODO: Currently experimenting using two adders, one for primary
indexes and one for secondary indexes. This helps us achieve the
second win as well. Will have a follow up PR once the roachtest
stabilizes.

Release note: None
@adityamaru27
Copy link
Contributor Author

TFTR
bors r+

@craig
Copy link
Contributor

craig bot commented Aug 4, 2019

Build failed

@dt
Copy link
Member

dt commented Aug 4, 2019

Looks like the same flake that I saw earlier.

bors r=dt

craig bot pushed a commit that referenced this pull request Aug 4, 2019
39271: importccl: Remove pre-buffering stage from direct ingest IMPORT r=dt a=adityamaru27

This change removes the pre-buffering step in the direct ingest
IMPORT code path. Previously, we would create separate buckets
for each table's primary data, and when the bucket would be full
we would flush it to the BulkAdder. Running an import on 3 nodes
of tpcc 1K OOM'ed as a result of this buffer.

Two big wins we got from this pre-buffering stage were:
1. We avoided worst case overlapping behavior in the AddSSTable
calls as a result of flushing keys with the same TableIDIndexID
prefix, together.

2. Secondary index KVs which were few and filled the bucket
infrequently were flushed only a few times, resulting in fewer
L0 (and total) files.

In order to resolve this OOM, we decided to take advantage of the
split keys we insert across AllIndexSpans of each table during
IMPORT. Since the BulkAdder is split aware and does not allow
SSTables to span across splits, we already achieve the
non-overlapping property we strive for (as mentioned above).
The downside is we lose the second win, as the KVs fed to the
BulkAdder are now ungrouped. This results in larger  number of
smaller SSTs being flushed, causing a spike in L0 and
total number of files, but overall less memory usage.

This change also ENABLES the `import/experimental-direct-ingestion`
roachtest.

TODO: Currently experimenting using two adders, one for primary
indexes and one for secondary indexes. This helps us achieve the
second win as well. Will have a follow up PR once the roachtest
stabilizes.

Release note: None

Co-authored-by: Aditya Maru <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 4, 2019

Build succeeded

@craig craig bot merged commit c689774 into cockroachdb:master Aug 4, 2019
@adityamaru27 adityamaru27 deleted the blazing-fast-import branch August 4, 2019 03:08
adityamaru27 added a commit to adityamaru27/cockroach that referenced this pull request Aug 7, 2019
This is another change to stabilize direct ingest import before
it is made the default.
As a consequence of cockroachdb#39271, the number of files (L0 and total),
along with the cumulative compaction size increased drastically.
A consequence of no longer creating buckets of TableIDIndexID
before flushing is that the single bulk adder would receive a
mix of primary and secondary index entries. Since SSTs cannot
span across the splits we inserted between index spans, it would
create numerous, small secondary index SSTs along with the
bigger primary index SSTs, and flush on reaching its limit
(which would be often).

By introducing two adders, one for ingesting primary index data,
and the other for ingesting secondary index data we regain the
ability to make fewer, bigger secondary index SSTs and flush less
often. The peak mem is lower than what prebuffering used to
hit, while the number of files (L0 and total), and the cumulative
compaction size return to prebuffering levels.

Some stats below for a tpcc 1k, on a 1 node cluster.

With prebuffering:
Total Files : 7670
L0 Files : 1848
Cumulative Compaction (GB): 24.54GiB

Without prebuffering, one adder:
Total Files : 22420
L0 Files : 16900
Cumulative Compaction (GB): 52.43 GiB

Without prebuffering, two adders:
Total Files : 6805
L0 Files : 1078
Cumulative Compaction (GB): 18.89GiB

Release note: None
adityamaru27 added a commit to adityamaru27/cockroach that referenced this pull request Aug 8, 2019
This is another change to stabilize direct ingest import before
it is made the default.
As a consequence of cockroachdb#39271, the number of files (L0 and total),
along with the cumulative compaction size increased drastically.
A consequence of no longer creating buckets of TableIDIndexID
before flushing is that the single bulk adder would receive a
mix of primary and secondary index entries. Since SSTs cannot
span across the splits we inserted between index spans, it would
create numerous, small secondary index SSTs along with the
bigger primary index SSTs, and flush on reaching its limit
(which would be often).

By introducing two adders, one for ingesting primary index data,
and the other for ingesting secondary index data we regain the
ability to make fewer, bigger secondary index SSTs and flush less
often. The peak mem is lower than what prebuffering used to
hit, while the number of files (L0 and total), and the cumulative
compaction size return to prebuffering levels.

Some stats below for a tpcc 1k, on a 1 node cluster.

With prebuffering:
Total Files : 7670
L0 Files : 1848
Cumulative Compaction (GB): 24.54GiB

Without prebuffering, one adder:
Total Files : 22420
L0 Files : 16900
Cumulative Compaction (GB): 52.43 GiB

Without prebuffering, two adders:
Total Files : 6805
L0 Files : 1078
Cumulative Compaction (GB): 18.89GiB

Release note: None
craig bot pushed a commit that referenced this pull request Aug 8, 2019
39418: stats: truncate large datums when sampling for histogram r=rytaft a=rytaft

This commit adds logic to truncate long bit arrays, byte strings,
strings, and collated strings during sampling for histogram creation.
We do this to avoid using excessive memory or disk space during
sampling and storage of the final histogram.

Release note: None

39424: importccl: Direct-ingest uses two bulk adders instead of one. r=adityamaru27 a=adityamaru27

This is another change to stabilize direct ingest import before
it is made the default.
As a consequence of #39271, the number of files (L0 and total),
along with the cumulative compaction size increased drastically.
A consequence of no longer creating buckets of TableIDIndexID
before flushing is that the single bulk adder would receive a
mix of primary and secondary index entries. Since SSTs cannot
span across the splits we inserted between index spans, it would
create numerous, small secondary index SSTs along with the
bigger primary index SSTs, and flush on reaching its limit
(which would be often).

By introducing two adders, one for ingesting primary index data,
and the other for ingesting secondary index data we regain the
ability to make fewer, bigger secondary index SSTs and flush less
often. The peak mem is lower than what prebuffering used to
hit, while the number of files (L0 and total), and the cumulative
compaction size return to prebuffering levels.

Some stats below for a tpcc 1k, on a 1 node cluster.

With prebuffering:
Total Files : 7670
L0 Files : 1848
Cumulative Compaction (GB): 24.54GiB

Without prebuffering, one adder:
Total Files : 22420
L0 Files : 16900
Cumulative Compaction (GB): 52.43 GiB

Without prebuffering, two adders:
Total Files : 6805
L0 Files : 1078
Cumulative Compaction (GB): 18.89GiB

Release note: None

Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants