Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data using the executeSelect API is slow #2764

Closed
o-shevchenko opened this issue Nov 11, 2024 · 18 comments · Fixed by googleapis/java-bigquery#3625 or googleapis/java-bigquery#3624
Assignees
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API.

Comments

@o-shevchenko
Copy link

o-shevchenko commented Nov 11, 2024

We use executeSelect API to run SQL query and read results from BigQuery. We expected a good speed based on this article

Reading data using executeSelectAPI is extremely slow.
Reading of 100_000 rows takes 23930 ms.
The profiling showed no prominent places where we spent most of the time.

Are there any recent changes that might cause performance degradation for such an API?
Do you have a benchmark to understand what performance we should expect?
Thanks!

Environment details

  1. com.google.cloud:google-cloud-bigquery:2.43.3
  2. Mac OS Sonoma M1
  3. Java version: 17

Code example

Mono.fromCallable { bigQueryOptionsBuilder.build().service }
            .flatMap { context ->
                val connectionSettings = ConnectionSettings.newBuilder()
                    .setRequestTimeout(10L)
                    .setUseReadAPI(true)
                    .setMaxResults(1000)
                    .setNumBufferedRows(1000)
                    .setUseQueryCache(true)
                    .build();
                val connection = context.createConnection(connectionSettings)
                val bqResult = connection.executeSelect(sql)
                val result = Flux.usingWhen(
                    Mono.just(bqResult.resultSet),
                    { resultSet -> resultSet.toFlux(bqResult.schema) },
                    { _ -> Mono.fromRunnable<Unit> { connection.close() } }
                )
                Mono.just(Data(result, bqResult.schema.toSchema()))
            }
            ...
            
fun ResultSet.toFlux(schema:Schema): Flux<DataRecord> {
    return Flux.generate<DataRecord> { sink ->
        if (next()) {
            sink.next(toDataRecord(schema))
        } else {
            sink.complete()
        }
    }
}
@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. label Nov 11, 2024
@o-shevchenko o-shevchenko changed the title Read data via executeSelect API is slow Read data with executeSelect API is slow Nov 11, 2024
@o-shevchenko o-shevchenko changed the title Read data with executeSelect API is slow Reading data using the executeSelect API is slow Nov 11, 2024
@o-shevchenko
Copy link
Author

o-shevchenko commented Nov 14, 2024

I've created a simplified test to show performance:

@Test
    fun `test read`() {
        val sql =
            """
            SELECT *
            FROM `pr`
            """.trimIndent().replace("\n", " ")
        val connectionSettings = ConnectionSettings.newBuilder()
            .setRequestTimeout(300)
            .setUseReadAPI(true)
            .setMaxResults(5000)
            .setUseQueryCache(true)
            .build()
        val connection = bigQueryOptionsBuilder.build().service.createConnection(connectionSettings)
        val bqResult = connection.executeSelect(sql)
        val resultSet = bqResult.resultSet

        var n = 1
        var lastTime = Instant.now()
        while (++n < 1_000_000 && resultSet.next()) {
            if (n % 30_000 == 0) {
                val now = Instant.now()
                val duration = Duration.between(lastTime, now)
                println("ROW $n Time: ${duration.toMillis()} ms ${DateTimeFormatter.ISO_INSTANT.format(now)}")
                lastTime = now
            }
        }
   }     
ROW 30000 Time: 5516 ms 2024-11-14T12:35:54.354169Z
ROW 60000 Time: 11230 ms 2024-11-14T12:36:05.585005Z
ROW 90000 Time: 5645 ms 2024-11-14T12:36:11.230378Z
ROW 120000 Time: 5331 ms 2024-11-14T12:36:16.561915Z
ROW 150000 Time: 5458 ms 2024-11-14T12:36:22.019994Z
ROW 180000 Time: 5391 ms 2024-11-14T12:36:27.411807Z

~5sec to read 30000 rows

@o-shevchenko
Copy link
Author

Related issue with benchmark: googleapis/java-bigquery#3574

@o-shevchenko
Copy link
Author

o-shevchenko commented Nov 15, 2024

After fixing the test I've got the following results.

Benchmark                                            (rowLimit)  Mode  Cnt       Score       Error  Units
ConnImplBenchmark.iterateRecordsUsingReadAPI             500000  avgt    3   76549.893 ± 14496.839  ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI            1000000  avgt    3  154957.127 ± 25916.110  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query       500000  avgt    3   82508.807 ± 17930.275  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query      1000000  avgt    3  165717.219 ± 86960.648  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI      500000  avgt    3   84504.175 ± 36823.590  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI     1000000  avgt    3  165142.367 ± 99899.991  ms/op

That's not what we expected after reading the doc: https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/
image

Comparison with Chart Estimates
From the chart:
1,000,000 rows:
Read Storage API: The speed on the chart is ~50,000 rows/sec, but I have 6,453 rows/sec.
tabledata.list API: Estimated at 5,000 rows/sec and I've got similar result 5,917 rows/sec.

Is there anything I missed?

@o-shevchenko
Copy link
Author

@Neenu1995 Could you please help address this issue?
Thanks!

@o-shevchenko
Copy link
Author

@leahecole, do you have any ideas on how to proceed here?
Thanks!

@PhongChuong PhongChuong self-assigned this Dec 2, 2024
@PhongChuong
Copy link
Contributor

Thanks for fixing the benchmark tests. The result is definitely unexpected. Did you see better results for the Read API in a previous version of com.google.cloud:google-cloud-bigquery?

AFAIK, the changes to executeSelect/Connection in the com.google.cloud:google-cloud-bigquery repo should not have caused any performance degradation.

@yirutang , do you know if there are any recent changes/configuration changes that might cause the performance slowdown?

@o-shevchenko
Copy link
Author

Thanks for the reply @PhongChuong !
We are trying to build a fast and reliable integration with BigQuery to read and write data using 2.44.0, so I can't say when performance degradation started.
I've just tested 2.40.0 and 2.30.0 and see the same result in terms of speed.
Let me know if I need to test any other version.

@o-shevchenko
Copy link
Author

o-shevchenko commented Dec 2, 2024

I've noticed that the first run usually is much faster.
Are there any BigQuery configurations or limits we have to configure?
I've checked all docs that I was able to find and don't see anything else besides enabling BigQuery Storage Read API
https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/
https://cloud.google.com/bigquery/docs/reference/storage
I don't see any limits or quotas that were exceeded either.
Do you see the same result when you run the perf test?

@o-shevchenko
Copy link
Author

@PhongChuong Do you have any insights on that?
Were you able to reproduce such a performance problem?

@yirutang
Copy link
Contributor

yirutang commented Dec 9, 2024

@kmjung

@kmjung
Copy link
Contributor

kmjung commented Dec 9, 2024

@o-shevchenko do you have the query job IDs and read session IDs involved in these benchmark runs? @PhongChuong do you know if it's possible to have the client log these?

Also, what is the benchmark configuration that you're using here? Where (e.g in what GCP region or multi-region) is your BigQuery data, where are you running the benchmark, etc?

@o-shevchenko
Copy link
Author

Hi @kmjung
I use this BQ benchmark with public dataset bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017.
I fixed the test in java-bigquery repo.
It always reproduces. You can just run it as is to see if you have the same read speed.
I just use my creds in GOOGLE_APPLICATION_CREDENTIALS=path/to/service_account.json
googleapis/java-bigquery#3574

Let me know if you need any other details

@kmjung
Copy link
Contributor

kmjung commented Dec 9, 2024

Where (e.g in what GCP region or multi-region) is your BigQuery data, where are you running the benchmark, etc?

Can you provide this information also? e.g. are you using a GCE VM in the same region as your BQ data? Running in AWS?

@o-shevchenko
Copy link
Author

image

My service is deployed in AWS, in the same region. I don't think that network latency is involved that much. The read is very slow.

@kmjung
Copy link
Contributor

kmjung commented Dec 9, 2024

I am less convinced that this is not related to AWS <--> GCE network latency. I will reach out offline to collect more data from you.

@o-shevchenko
Copy link
Author

Thanks @kmjung . I've created a support case.
I also run benchmarks from my local machine where I directly read data from BQ, and it's 10x slower than the Snowflake JDBC connector for the same setup.
So, I assumed something was wrong with BQ API or our project configs.

@o-shevchenko
Copy link
Author

o-shevchenko commented Jan 7, 2025

Hi
I want to provide an update. BQ support team helped to figure out the problem.
SDK still uses the old API and just skips ReadAPI even if it's enabled.
This happens because of this check: https://github.com/googleapis/java-bigquery/blob/08c483cb726bb36951680687260dd2d0371a9dc0/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/ConnectionImpl.java#L1201

In order to use ReadAPI, we need to add any of these configurations, for example, setJobTimeoutMs.
After this, we will see Not Using Fast Query Path, using jobs.insert and Using BigQuery Read API instead of Using Fast Query Path.
Such an implicit behavior still should be fixed, but at least we have a workaround.

val connectionSettings = ConnectionSettings.newBuilder()
                    .setJobTimeoutMs(Long.MAX_VALUE) // workaround to force ReadAPI usage
                    .setRequestTimeout(DEFAULT_QUERY_TIMEOUT_MS)
                    .setUseReadAPI(true)
                    .setMaxResults(DEFAULT_MAX_RESULT)
                    .setUseQueryCache(true)
                    .build()
                val connection = context.createConnection(connectionSettings)
                val bqResult = connection.executeSelect(code)

After enabling ReadAPI we've got 6-10x speed improvement:

Benchmark                                     (rowLimit)  Mode  Cnt       Score   Error  Units
ConnImplBenchmark.iterateRecordsUsingReadAPI      500000  avgt        16439.884          ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI     1000000  avgt        16005.065          ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI    10000000  avgt        74471.796          ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI    50000000  avgt       484732.516          ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI   100000000  avgt        48608.101          ms/op

Thank you all for the help! I appreciate your help, @kmjung, with testing and escalating this issue.
I hope it will help improve the library and provide better performance to other clients.

@o-shevchenko
Copy link
Author

I've opened PR to enable ReadAPI for the benchmark
googleapis/java-bigquery#3625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API.
Projects
None yet
4 participants