-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989
Labels
area/docdb
YugabyteDB core features
Comments
deeps1991
added a commit
that referenced
this issue
Jan 19, 2022
Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Test Plan: ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex Reviewed By: alex Subscribers: kannan, yql Differential Revision: https://phabricator.dev.yugabyte.com/D14613
deeps1991
added a commit
that referenced
this issue
Feb 5, 2022
…ggregate Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Original diff: https://phabricator.dev.yugabyte.com/D14613 Original commit: ccabdf7 Test Plan: ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex, myang Reviewed By: myang Subscribers: myang, yql Differential Revision: https://phabricator.dev.yugabyte.com/D15257
deeps1991
added a commit
that referenced
this issue
Mar 16, 2022
…gregate Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Original diff: https://phabricator.dev.yugabyte.com/D14613 Original commit: ccabdf7 Test Plan: Jenkins: rebase: 2.8 ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D15821
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
The PgRegressPartitions test has been failing for a long time in the pg_partition_aggregate test with the following error:
This always occurs for 2 tables: pagg_tab_ml and pagg_tab_para, specifically when ANALYZE is run on them. These are the only tables with a comparatively large number of rows (for a unit test) in them (30,000)
Analysis so far:
The transaction was aborted because no heartbeat was sent by the transaction_coordinator for longer than 5 seconds:
However, the postgres process was indeed trying to send the heartbeat multiple times over the above period, however they all failed with error stating that the TS it was reaching out to was not the leader
This seems to occur because of ping-ponging leadership changes for the tablet 46b21bb306144b82b906ab8aa0ba8718. Tracing back its leadership changes:
7)Seeing these logs from other nodes:
There were no more errors and all the operations go through successfully for the test after this with expected results. This issue is not seen after I reduce the size of these two tables from 30,000 to say 3000.
This does seem like a cluster overload scenario, but would be good to understand why cluster overload manifests as ping-ponging leadership changes for the Transaction status tablet. Note that this is a PgRegress test which means there was just this single long running operation on one connection at the time of error.
Complete logs can be found here:
org.yb.pgsql.TestPgRegressPartitions-output-3.txt
The text was updated successfully, but these errors were encountered: