Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: unskip tests hitting combinedstmts and stmts endpoints #109381

Merged
merged 1 commit into from
Aug 29, 2023

Conversation

gtr
Copy link
Contributor

@gtr gtr commented Aug 23, 2023

Fixes: #109184.

Previously, tests which hit the combinedstmts and statements
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@gtr gtr requested a review from a team August 24, 2023 13:12
@gtr gtr force-pushed the unskip-sql-stats-tests branch 4 times, most recently from 33c7250 to 096560e Compare August 24, 2023 18:39
@j82w
Copy link
Contributor

j82w commented Aug 25, 2023

pkg/server/application_api/sql_stats_test.go line 513 at r1 (raw file):

	// Increase the timeout for the http client if under stress.
	additionalTimeout := 0 * time.Second

On the test that is failing can you add a custom message to the require to output the full respStatement? I think it will help understand what is causing the failure, and will be helpful in the future if the test fails.

@gtr gtr force-pushed the unskip-sql-stats-tests branch 2 times, most recently from 4427642 to 1eddec9 Compare August 25, 2023 15:34
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w)


pkg/server/application_api/sql_stats_test.go line 513 at r1 (raw file):

Previously, j82w (Jake) wrote…

On the test that is failing can you add a custom message to the require to output the full respStatement? I think it will help understand what is causing the failure, and will be helpful in the future if the test fails.

Done. Full response is printed.

@gtr gtr force-pushed the unskip-sql-stats-tests branch 8 times, most recently from 580abd7 to 96106c0 Compare August 28, 2023 13:12
@j82w
Copy link
Contributor

j82w commented Aug 28, 2023

pkg/server/combined_statement_stats.go line 801 at r3 (raw file):

		distSQLCountStr := string(*row[4].(*tree.DString))
		distSQLCountInt, err := strconv.ParseInt(distSQLCountStr, 10, 64)

You could cast the value in the query COALESCE(merged_metadata ->> 'distSQLCount'::INT,0) AS distSQLCount, to avoid the parse logic and if the row doesn't have any value.

array_agg(distinct transaction_fingerprint_id) AS txn_fingerprints,
app_name,
max(aggregated_ts) AS aggregated_ts,
crdb_internal.merge_stats_metadata(array_agg(metadata)) AS merged_metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this query is for the activity table, it needs to use crdb_internal.merge_aggregated_stmt_metadata because the metadata column in the activity table is already aggregated from the full stmts system table.

@gtr gtr force-pushed the unskip-sql-stats-tests branch from 96106c0 to fda4a9e Compare August 28, 2023 13:37
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)


pkg/server/combined_statement_stats.go line 698 at r3 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

If this query is for the activity table, it needs to use crdb_internal.merge_aggregated_stmt_metadata because the metadata column in the activity table is already aggregated from the full stmts system table.

Good catch! I forgot to change the function call for this table when re-writing the query.


pkg/server/combined_statement_stats.go line 801 at r3 (raw file):

Previously, j82w (Jake) wrote…

You could cast the value in the query COALESCE(merged_metadata ->> 'distSQLCount'::INT,0) AS distSQLCount, to avoid the parse logic and if the row doesn't have any value.

Makes sense, done.

@j82w
Copy link
Contributor

j82w commented Aug 28, 2023

pkg/server/combined_statement_stats.go line 654 at r4 (raw file):

	const expectedNumDatums = 11
	const queryFormat = `
WITH Aggregated AS (

What is the reason for converting this use a CTE? Does it impact the query plan?

@gtr gtr force-pushed the unskip-sql-stats-tests branch from fda4a9e to c5fac81 Compare August 28, 2023 15:06
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped additional timeout to 30s.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)


pkg/server/combined_statement_stats.go line 654 at r4 (raw file):

Previously, j82w (Jake) wrote…

What is the reason for converting this use a CTE? Does it impact the query plan?

I used the CTE for readability. AFAIK, using a CTE doesn't impact the query plan since its result isn't stored but rather integrated into the main query. I've revised the query, eliminating the CTE and opting for a subquery, and made some minor formatting adjustments for consistency.

@gtr gtr force-pushed the unskip-sql-stats-tests branch from c5fac81 to b62647f Compare August 28, 2023 18:43
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes:

  • Increased timeout for TestStatusAPIStatements to 30s when under stress.
  • Improved clarity by renaming variables and helper functions.
  • Addressed transient 40001 error by filtering for statements associated with the test app name that also executed successfully (since statements which failed and then succeeded have different fingerprints). We also check that each executed statement is found in the response and if it is not, it's an error.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)


pkg/server/application_api/sql_stats_test.go line 642 at r5 (raw file):

			actualFailed := respStatement.Key.KeyData.Failed

			stmtJSONString := responseToJSON(respStatement)

I noticed that I'm currently marshalling to JSON (an expensive operation) irrespective of whether the expected data matches the actual data. Ideally, we should only marshal when there's a discrepancy between the expected and actual data. Outside of using a verbose if expectedData.fullScan != actualFullScan || expectedData.FullScan != ..., I'm uncertain about alternative approaches. I believe the current method should be fine, but I thought it was worth highlighting.

@gtr gtr force-pushed the unskip-sql-stats-tests branch from b62647f to 92b4279 Compare August 28, 2023 19:06
Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @gtr and @xinhaoz)


pkg/server/combined_statement_stats.go line 798 at r5 (raw file):

		query := string(tree.MustBeDString(row[7]))
		querySummary := string(tree.MustBeDString(row[8]))
		databases := string(tree.MustBeDString(row[9]))

Is this still formatted correctly? Not sure how a JSON string array is converted to a a string.


pkg/server/application_api/sql_stats_test.go line 624 at r5 (raw file):

		// statement statistics received from the server response.
		actualResponseStatsMap := make(map[string]serverpb.StatementsResponse_CollectedStatementStatistics)
    for _, respStatement := range resp.Statements {

nit: fix indents


pkg/server/application_api/sql_stats_test.go line 626 at r5 (raw file):

    for _, respStatement := range resp.Statements {
			// Only consider statements that were successfully executed by the test app.
			if respStatement.Key.KeyData.App == testAppName && !respStatement.Key.KeyData.Failed {

Please add a note of why the success check is necessary. That way if someone in the future is looking at the test they can see automatic retries are the reason.


pkg/server/application_api/sql_stats_test.go line 633 at r5 (raw file):

		for respQuery, expectedData := range expectedStatementStatsMap {
			respStatement, exists := actualResponseStatsMap[respQuery]
			if !exists {

nit: require.True


pkg/server/application_api/sql_stats_test.go line 642 at r5 (raw file):

Previously, gtr (Gerardo Torres Castro) wrote…

I noticed that I'm currently marshalling to JSON (an expensive operation) irrespective of whether the expected data matches the actual data. Ideally, we should only marshal when there's a discrepancy between the expected and actual data. Outside of using a verbose if expectedData.fullScan != actualFullScan || expectedData.FullScan != ..., I'm uncertain about alternative approaches. I believe the current method should be fine, but I thought it was worth highlighting.

This is a test so deserializing a single JSON object shouldn't be an issue. If you wanted to avoid it you could use a func to make it lazy.

getStmtJsonString := func(){
    return responseToJSON(respStatement)
}

@j82w
Copy link
Contributor

j82w commented Aug 28, 2023

pkg/server/application_api/sql_stats_test.go line 642 at r5 (raw file):

Previously, j82w (Jake) wrote…

This is a test so deserializing a single JSON object shouldn't be an issue. If you wanted to avoid it you could use a func to make it lazy.

getStmtJsonString := func(){
    return responseToJSON(respStatement)
}

Ignore my func suggestion. The require doesn't take a func, so it will just get evaluated for each call instead of just once.

@gtr gtr force-pushed the unskip-sql-stats-tests branch from 92b4279 to cbc28cd Compare August 28, 2023 22:22
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)


pkg/server/combined_statement_stats.go line 798 at r5 (raw file):

Previously, j82w (Jake) wrote…

Is this still formatted correctly? Not sure how a JSON string array is converted to a a string.

Fixed. Previously this was resulting in a value like ["db1", "db2"] now it should just be db1,db2 which is what was returned before.

I was wondering if it was possible to do this string array to string conversion in the SQL query above so we could skip the JSON unmarshalling but I wasn't able to find a reliable query. Using -> we get a JSON datum so I don't think we could have avoided that but let me know what you think.


pkg/server/application_api/sql_stats_test.go line 624 at r5 (raw file):

Previously, j82w (Jake) wrote…

nit: fix indents

Done.


pkg/server/application_api/sql_stats_test.go line 626 at r5 (raw file):

Previously, j82w (Jake) wrote…

Please add a note of why the success check is necessary. That way if someone in the future is looking at the test they can see automatic retries are the reason.

Done.


pkg/server/application_api/sql_stats_test.go line 633 at r5 (raw file):

Previously, j82w (Jake) wrote…

nit: require.True

Done.


pkg/server/application_api/sql_stats_test.go line 642 at r5 (raw file):

Previously, j82w (Jake) wrote…

Ignore my func suggestion. The require doesn't take a func, so it will just get evaluated for each call instead of just once.

Ah yes I also wanted to use the lazy func ☹️. But SGTM.

@gtr gtr marked this pull request as ready for review August 28, 2023 22:25
@gtr gtr requested review from a team and ericharmeling and removed request for a team August 28, 2023 22:25
@gtr gtr force-pushed the unskip-sql-stats-tests branch from cbc28cd to 3088301 Compare August 28, 2023 23:07
Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @gtr, and @xinhaoz)


-- commits line 12 at r7:
This fixes a bug customers have reported so there should be a release note.


pkg/server/application_api/sql_stats_test.go line 510 at r7 (raw file):

	additionalTimeout := 0 * time.Second
	if skip.Stress() {
		additionalTimeout = 30 * time.Second

Can there be a const for the stress timeout value? I noticed some tests seem to be using 30 and others 45 seconds.

@gtr gtr force-pushed the unskip-sql-stats-tests branch from 3088301 to ab025e7 Compare August 29, 2023 13:47
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @j82w, and @xinhaoz)


-- commits line 12 at r7:

Previously, j82w (Jake) wrote…

This fixes a bug customers have reported so there should be a release note.

Done.


pkg/server/combined_statement_stats.go line 798 at r5 (raw file):

Previously, gtr (Gerardo Torres Castro) wrote…

Fixed. Previously this was resulting in a value like ["db1", "db2"] now it should just be db1,db2 which is what was returned before.

I was wondering if it was possible to do this string array to string conversion in the SQL query above so we could skip the JSON unmarshalling but I wasn't able to find a reliable query. Using -> we get a JSON datum so I don't think we could have avoided that but let me know what you think.

Following up on this: the following query worked for me:

defaultdb> SELECT string_agg(elem::text, ',') AS databases
           FROM json_array_elements_text(('{"db": ["mo\"vr","tb2"], "db2": ["movr","tb2"]}'::jsonb)->'db') AS elem;

  databases
-------------
  mo"vr,tb2
(1 row)

The queries have been updated.


pkg/server/application_api/sql_stats_test.go line 510 at r7 (raw file):

Previously, j82w (Jake) wrote…

Can there be a const for the stress timeout value? I noticed some tests seem to be using 30 and others 45 seconds.

Done. Also for that 45 second timeout, the test for which I did that times out under stress it seems to be very expensive so I skipped it.

Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @gtr, and @xinhaoz)


-- commits line 12 at r7:

Previously, gtr (Gerardo Torres Castro) wrote…

Done.

Looking at the PR it doesn't actually fix anything for users so the release note should be none like you originally had it. Can you revert it?


pkg/server/application_api/sql_stats_test.go line 692 at r8 (raw file):

	// Resource-intensive test, times out under stress.
	skip.UnderStressRace(t, "expensive tests")

Why is this still skipped for stress race?

Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
@gtr gtr force-pushed the unskip-sql-stats-tests branch from ab025e7 to 27ed761 Compare August 29, 2023 16:02
Copy link
Contributor Author

@gtr gtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @j82w, and @xinhaoz)


-- commits line 12 at r7:

Previously, j82w (Jake) wrote…

Looking at the PR it doesn't actually fix anything for users so the release note should be none like you originally had it. Can you revert it?

Done.


pkg/server/application_api/sql_stats_test.go line 692 at r8 (raw file):

Previously, j82w (Jake) wrote…

Why is this still skipped for stress race?

This specific test continues to time out regardless of any timeout extensions for the http client. A related test, TestStatusAPIStatementDetails, has also been skipped and interestingly, it queries the stmtdetails endpoint 10 times. This test queries the combinedStmts endpoint 9 times. It seems the sheer number of endpoint hits may contribute to the timeouts, as other non-skipped tests only access it between 1-4 times.

I'm open to increasing the timeout for this test specifically to 45s but let me know what you think.

Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ericharmeling and @xinhaoz)

@gtr
Copy link
Contributor Author

gtr commented Aug 29, 2023

TFTR!
bors r+

@gtr gtr added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Aug 29, 2023
@craig
Copy link
Contributor

craig bot commented Aug 29, 2023

Build succeeded:

@craig craig bot merged commit 72636bd into cockroachdb:master Aug 29, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 29, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 27ed761 to blathers/backport-release-22.2-109381: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 27ed761 to blathers/backport-release-23.1-109381: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 27ed761 to blathers/backport-release-23.1.9-rc-109381: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.9-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql stats: extend timeout for tests for combinedstmts endpoints
4 participants