Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make better use of available cpu on larger VMs #57

Merged
merged 2 commits into from
Nov 18, 2024
Merged

Conversation

istreeter
Copy link
Collaborator

These changes allow the loader to better utilize all cpu available on a larger instance.

1. CPU-intensive parsing/transforming is now parallelized. Parallelism is configured by a new config parameter cpuParallelismFraction. The actual parallelism is chosen dynamically based on the number of available CPU, so the default value should be appropriate for all sized VMs.

2. We now open a new Snowflake ingest client per channel. Note the Snowflake SDK recommends to re-use a single Client per VM and open multiple Channels on the same Client. So here we are going against the recommendations. But, we justify it because it gives the loader better visiblity of when the client's Future completes, signifying a complete write to Snowflake.

3. Upload parallelism chosen dynamically. Larger VMs benefit from higher upload parallelism, in order to keep up with the faster rate of batches produced by the cpu-intensive tasks. Parallelsim is configured by a new parameter uploadParallelismFactor, which gets multiplied by the number of available CPU. The default value should be appropriate for all sized VMs.

These new settings have been tested on pods ranging from 0.6 to 8 available CPU.

These changes allow the loader to better utilize all cpu available on a
larger instance.

**1. CPU-intensive parsing/transforming is now parallelized**.
Parallelism is configured by a new config parameter
`cpuParallelismFraction`. The actual parallelism is chosen dynamically
based on the number of available CPU, so the default value should be
appropriate for all sized VMs.

**2. We now open a new Snowflake ingest client per channel**. Note the
Snowflake SDK recommends to re-use a single Client per VM and open
multiple Channels on the same Client.  So here we are going against the
recommendations.  But, we justify it because it gives the loader better
visiblity of when the client's Future completes, signifying a complete
write to Snowflake.

**3. Upload parallelism chosen dynamically**. Larger VMs benefit from
higher upload parallelism, in order to keep up with the faster rate of
batches produced by the cpu-intensive tasks. Parallelsim is configured
by a new parameter `uploadParallelismFactor`, which gets multiplied by
the number of available CPU. The default value should be appropriate for
all sized VMs.

These new settings have been tested on pods ranging from 0.6 to 8
available CPU.
Copy link

@benjben benjben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍

# -- name to use for the snowflake channel.
# -- Prefix to use for the snowflake channels.
# -- The full name will be suffixed with a number, e.g. `snowplow-1`
# -- The prefix be unique per loader VM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# -- The prefix be unique per loader VM
# -- The prefix must be unique per loader VM

maybe ?

@@ -75,10 +77,17 @@
# - Events are emitted to Snowflake for a maximum of this duration, even if the `maxBytes` size has not been reached
"maxDelay": "1 second"

# - How many batches can we send simultaneously over the network to Snowflake.
"uploadConcurrency": 1
# - Controls ow many batches can we send simultaneously over the network to Snowflake.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - Controls ow many batches can we send simultaneously over the network to Snowflake.
# - Controls how many batches can we send simultaneously over the network to Snowflake.

@@ -152,8 +144,8 @@ object Processing {
}

/** Parse raw bytes into Event using analytics sdk */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Parse raw bytes into Event using analytics sdk */

@istreeter istreeter merged commit c462da4 into develop Nov 18, 2024
2 checks passed
@istreeter istreeter deleted the cpu-parallelism branch November 18, 2024 16:55
istreeter added a commit that referenced this pull request Nov 26, 2024
These changes allow the loader to better utilize all cpu available on a
larger instance.

**1. CPU-intensive parsing/transforming is now parallelized**.
Parallelism is configured by a new config parameter
`cpuParallelismFraction`. The actual parallelism is chosen dynamically
based on the number of available CPU, so the default value should be
appropriate for all sized VMs.

**2. We now open a new Snowflake ingest client per channel**. Note the
Snowflake SDK recommends to re-use a single Client per VM and open
multiple Channels on the same Client.  So here we are going against the
recommendations.  But, we justify it because it gives the loader better
visiblity of when the client's Future completes, signifying a complete
write to Snowflake.

**3. Upload parallelism chosen dynamically**. Larger VMs benefit from
higher upload parallelism, in order to keep up with the faster rate of
batches produced by the cpu-intensive tasks. Parallelsim is configured
by a new parameter `uploadParallelismFactor`, which gets multiplied by
the number of available CPU. The default value should be appropriate for
all sized VMs.

These new settings have been tested on pods ranging from 0.6 to 8
available CPU.
istreeter added a commit to snowplow-incubator/common-streams that referenced this pull request Jan 17, 2025
In snowplow-incubator/snowflake-loader#57 we added code to the snowflake
loader so it checkpoints once every 10 seconds instead of once per
batch. This meant we could decrease the write-throughput requirements of
the DynamoDB table.

This commit moves the logic over here into common-streams so that all
loaders get the benefit of this improvement.
istreeter added a commit to snowplow-incubator/common-streams that referenced this pull request Jan 17, 2025
In snowplow-incubator/snowflake-loader#57 we added code to the snowflake
loader so it checkpoints once every 10 seconds instead of once per
batch. This meant we could decrease the write-throughput requirements of
the DynamoDB table.

This commit moves the logic over here into common-streams so that all
loaders get the benefit of this improvement.
istreeter added a commit to snowplow-incubator/common-streams that referenced this pull request Jan 17, 2025
In snowplow-incubator/snowflake-loader#57 we added code to the snowflake
loader so it checkpoints once every 10 seconds instead of once per
batch. This meant we could decrease the write-throughput requirements of
the DynamoDB table.

This commit moves the logic over here into common-streams so that all
loaders get the benefit of this improvement.
istreeter added a commit to snowplow-incubator/common-streams that referenced this pull request Jan 20, 2025
In snowplow-incubator/snowflake-loader#57 we added code to the snowflake
loader so it checkpoints once every 10 seconds instead of once per
batch. This meant we could decrease the write-throughput requirements of
the DynamoDB table.

This commit moves the logic over here into common-streams so that all
loaders get the benefit of this improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants