Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1369280 Close channels asynchronously #841

Merged

Conversation

sfc-gh-lshcharbaty
Copy link
Contributor

@sfc-gh-lshcharbaty sfc-gh-lshcharbaty commented May 8, 2024

Overview

SNOW-1369280
By closing channels in parallel, speeds up task rebalancing. As in #839, SFException is ignored and doesn't fail a connector task.

Pre-review checklist

  • This change should be part of a Behavior Change Release. See go/behavior-change.
  • This change has passed Merge gate tests
  • Snowpipe Changes
  • Snowpipe Streaming Changes
  • This change is TEST-ONLY
  • This change is README/Javadocs only
  • This change is protected by a config parameter snowflake.streaming.closeChannelsInParallel.enabled.
    • Yes - Added end to end and Unit Tests.
    • No - Suggest why it is not param protected
  • Is his change protected by parameter <PARAMETER_NAME> on the server side?
    • The parameter/feature is not yet active in production (partial rollout or PrPr, see Changes for Unreleased Features and Fixes).
    • If there is an issue, it can be safely mitigated by turning the parameter off. This is also verified by a test (See go/ppp).

@sfc-gh-lshcharbaty sfc-gh-lshcharbaty force-pushed the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch from 37db638 to b10f038 Compare May 8, 2024 16:11
Comment on lines +128 to +131
// Whether to close streaming channels in parallel.
public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
"snowflake.streaming.closeChannelsInParallel.enabled";
public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like a configuration should be exposed to customer. As a customer, I don't know whether I should close the channels in parallel or not. In fact, I don't really care, I just want my system to run without any failure. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed but I think we want to try this out for customers who has a lot of partitions in their single task. so enabling by default will not benefit a lot of customers

  • also its good to have param protection.
  • We should stress test this and slow roll out in couple of releases.

Copy link
Collaborator

@sfc-gh-xhuang sfc-gh-xhuang May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the risk of having it default true and then getting rid of the parameter in a future release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-tzhang
This is not a publicly announced parameter. It's going to behave as a knob that's going to allow tho customer to easily roll the change back in case something goes wrong. Exactly what @sfc-gh-japatel wrote.

@sfc-gh-xhuang
Imo it's better to be safe than sorry. I'm not sure how the parallelization is going to work on the customer environments as channel closing is assigned to a shared ForkJoinPool that seems to exist as a single instance and shared across all tasks on the same host.

I suggest we approach the "change rollout" parameters the following way:

  1. Protect the change with a parameter and disable it by default. If some customers experience problems that can be solved by the change, they can turn it on.
  2. Enable the change by default. If some customers experience problems that cause by the change, they can turn it off and communicate to us.
  3. Get rid of the older implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood but I was just thinking that there is not much difference/value between steps 1 and 2 but yes we can be a bit safer this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this still be a behavior change after we finish step 3? This configuration is gone and customer's KC won't be able to start, then they will need to get the logs to understand why. Assuming they know how to look into the log, then they realize that the configuration is being removed and they will probably need to reach out to Snowflake to understand why it's being removed?

It seems to me a better way is to do a beta release with the fix so customer can try it (or just provide a private jar)? Other customers won't use this version since it's a beta version.

Copy link
Contributor

@sfc-gh-akowalczyk sfc-gh-akowalczyk May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-tzhang I don't see any schema validation for configuration properties. We will just stop respecting the snowflake.streaming.closeChannelsInParallel.enabled config key, but no errors should be observed. So, I suppose there will be no behavioral change afterwards.

Or do I miss anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then that sounds better, thanks for confirming.

Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added couple of comments but overall LGTM!

Base automatically changed from lshcharbaty/SNOW-1369271_Ignore_SFException to master May 9, 2024 09:13
@sfc-gh-lshcharbaty sfc-gh-lshcharbaty force-pushed the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch from b10f038 to 48922a6 Compare May 9, 2024 09:15
Copy link
Contributor

@sfc-gh-dseweryn sfc-gh-dseweryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +128 to +131
// Whether to close streaming channels in parallel.
public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
"snowflake.streaming.closeChannelsInParallel.enabled";
public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this still be a behavior change after we finish step 3? This configuration is gone and customer's KC won't be able to start, then they will need to get the logs to understand why. Assuming they know how to look into the log, then they realize that the configuration is being removed and they will probably need to reach out to Snowflake to understand why it's being removed?

It seems to me a better way is to do a beta release with the fix so customer can try it (or just provide a private jar)? Other customers won't use this version since it's a beta version.


StreamingClientProvider.getStreamingClientProviderInstance()
.closeClient(this.connectorConfig, this.streamingIngestClient);
private void closeAllInParallel() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about more I'm not sure this approach would help a lot. If you look into the close function in the SDK, it will flush everything in the client and then call get status on this particular channel. If we're doing it in parallel, the same number of flush will be called, every flush will still be run in serial and block other flushes. I thought most of time is spent on waiting for server side to commit, which will stay the same since everything will be flush at once. Have you verified locally that it reduces the close time dramatically with a large number of channels?

Copy link
Contributor

@sfc-gh-akowalczyk sfc-gh-akowalczyk May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sfc-gh-tzhang, as Lex is OOO, I prepared a short test for that. I am not sure whether the IT is sufficient to give meaningful information, but I ran such a test closing 50 channels. It shows the parallelism speeded the execution up.

13513.7 ms of mean sequential execution time vs 1508.0 of parallel
image

But the scenario does not cover the case where we have rows present in net.snowflake.ingest.streaming.internal.SnowflakeStreamingIngestChannelInternal buffer (the scenario described by you) - I cannot reproduce it even though I played with the buffering parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, do you mean that closing 50 empty channels in serial takes 13.5s? That doesn't sound right, any idea where it spends most of the time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log level was set to DEBUG so it added a few additional seconds to the measured time. I changed it to INFO and attached the profiler. It takes ~11 seconds and spends most of its time on CompletableFuture.get().
image

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

Comment on lines +128 to +131
// Whether to close streaming channels in parallel.
public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
"snowflake.streaming.closeChannelsInParallel.enabled";
public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then that sounds better, thanks for confirming.


StreamingClientProvider.getStreamingClientProviderInstance()
.closeClient(this.connectorConfig, this.streamingIngestClient);
private void closeAllInParallel() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, do you mean that closing 50 empty channels in serial takes 13.5s? That doesn't sound right, any idea where it spends most of the time?

@ebuildy
Copy link

ebuildy commented May 16, 2024

As this is for better performance, is it a way to monitor it please?

This could help debugging crash/rebalance issues.

Thanks you, cant wait to test it!

})
.toArray(CompletableFuture[]::new);

CompletableFuture.allOf(futures).join();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my small comprehension, CompletableFuture uses the ForkJoinPool.commonPool() as its default thread pool with (Runtime.getRuntime().availableProcessors() - 1) threads. We can cheat by setting JVM cpu count but I rather suggest to create a thread pools like this:

import java.util.concurrent.*;

private void closeAllInParallel() {
    ExecutorService executor = Executors.newFixedThreadPool(CONFIG_THREAD_POOL_SIZE);
    CompletableFuture<?>[] futures =
        partitionsToChannel.entrySet().stream()
            .map(
                entry -> {
                  String channelKey = entry.getKey();
                  TopicPartitionChannel topicPartitionChannel = entry.getValue();

                  LOGGER.info("Closing partition channel:{}", channelKey);
                  return CompletableFuture.runAsync(topicPartitionChannel::closeChannelAsync, executor);
                })
            .toArray(CompletableFuture[]::new);

    CompletableFuture.allOf(futures).join();
    executor.shutdown(); // Don't forget to shutdown the executor
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ebuildy, for providing an example. Let me merge the PR in its current form, I'll check with the SDK team on a possible built-in solution. If the timeline is not optimistic, I'll propose this change to the rest of the team.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic! Any plan to do the same for open channel operations?

@ebuildy
Copy link

ebuildy commented May 16, 2024

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

The default thread pool used by CompletableFuture according doc at https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/CompletableFuture.html

All async methods without an explicit Executor argument are performed using the [ForkJoinPool.commonPool()](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/ForkJoinPool.html#commonPool())

Would be better to create a static thread pool

@sfc-gh-akowalczyk sfc-gh-akowalczyk merged commit 22ccebd into master May 20, 2024
81 checks passed
@sfc-gh-akowalczyk sfc-gh-akowalczyk deleted the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch May 20, 2024 12:38
@sfc-gh-akowalczyk
Copy link
Contributor

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

Thanks, created SNOW-1437461

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants