Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](routine-load) fix auto resume invalid when FE leader change or restart #37876

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 16, 2024

We meet routine load pause and never be auto resume even if it meet the conditions.

                  Id: 134305
                Name: lineitem_balance_dup_persistent_weekly_persistent_flow_weekly
          CreateTime: 2024-06-27 19:54:13
           PauseTime: 2024-06-28 23:02:46
             EndTime: NULL
              DbName: regression_test_stress_load_long_duration_load
           TableName: lineitem_balance_dup_persistent_weekly
        IsMultiTable: false
               State: PAUSED
      DataSourceType: KAFKA
      CurrentTaskNum: 0
       JobProperties: {"max_batch_rows":"550000","timezone":"Asia/Shanghai","send_batch_parallelism":"1","load_to_single_tablet":"false","column_separator":"','","line_delimiter":"\n","current_concurrent_number":"0","delete":"*","partial_columns":"false","merge_type":"APPEND","exec_mem_limit":"2147483648","strict_mode":"false","jsonpaths":"","max_batch_interval":"10","max_batch_size":"409715200","fuzzy_parse":"false","partitions":"*","columnToColumnExpr":"","whereExpr":"*","desired_concurrent_number":"100","precedingFilter":"*","format":"csv","max_error_number":"0","max_filter_ratio":"1.0","json_root":"","strip_outer_array":"false","num_as_string":"false"}
DataSourceProperties: {"topic":"test-topic-persistent-weekly-new","currentKafkaPartitions":"","brokerList":"xxx"}
    CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING","group.id":"test-consumer-group","client.id":"test-client-id"}
           Statistic: {"receivedBytes":2234836231654,"runningTxns":[],"errorRows":0,"committedTaskNum":1019074,"loadedRows":11693905636,"loadRowsRate":119675,"abortedTaskNum":13556,"errorRowsAfterResumed":0,"totalRows":11693905636,"unselectedRows":0,"receivedBytesRate":22871277,"taskExecuteTimeMs":97713660}
            Progress: {"0":"81666390","1":"81605244","2":"80934894","3":"81531594","4":"81866067","5":"80841194","6":"81229045","7":"80854534","8":"81305844","9":"81384530","10":"81016926","11":"81018762","12":"81586996","13":"81028852","14":"80836728","15":"81536307","16":"81191324","17":"80790892","18":"81518108","19":"80853947","20":"80944134","21":"81567859","22":"80967795","23":"80962887","24":"81444757","25":"81182803","26":"81081053","27":"81374984","28":"81089548","29":"81161297","30":"81981195","31":"80943196","32":"80979608","33":"81580092","34":"81596130","35":"80926873","36":"81569105","37":"81364000","38":"80947256","39":"81352057","40":"80864511","41":"81287226","42":"81579790","43":"80902247","44":"81059042","45":"81543945","46":"81137005","47":"80790072","48":"81365538","49":"81025127","50":"80887759","51":"81568479","52":"81013907","53":"80947134","54":"81569820","55":"81073842","56":"80873173","57":"81417107","58":"81120060","59":"81216134","60":"81336754","61":"81187291","62":"80989208","63":"81818417","64":"81038338","65":"80761949","66":"81466270","67":"80989322","68":"80962711","69":"81586888","70":"81073447","71":"80885426"}
                 Lag: {"0":-1,"1":-1,"2":-1,"3":-1,"4":-1,"5":-1,"6":-1,"7":-1,"8":-1,"9":-1,"10":-1,"11":-1,"12":-1,"13":-1,"14":-1,"15":-1,"16":-1,"17":-1,"18":-1,"19":-1,"20":-1,"21":-1,"22":-1,"23":-1,"24":-1,"25":-1,"26":-1,"27":-1,"28":-1,"29":-1,"30":-1,"31":-1,"32":-1,"33":-1,"34":-1,"35":-1,"36":-1,"37":-1,"38":-1,"39":-1,"40":-1,"41":-1,"42":-1,"43":-1,"44":-1,"45":-1,"46":-1,"47":-1,"48":-1,"49":-1,"50":-1,"51":-1,"52":-1,"53":-1,"54":-1,"55":-1,"56":-1,"57":-1,"58":-1,"59":-1,"60":-1,"61":-1,"62":-1,"63":-1,"64":-1,"65":-1,"66":-1,"67":-1,"68":-1,"69":-1,"70":-1,"71":-1}
ReasonOfStateChanged: 
        ErrorLogUrls: 
            OtherMsg: 
                User: root
             Comment:

If routine load pause and FE leader changes at the same time, pauseReason will be null if FE leader changes, so auto resume logic will never be triggered:

if (jobRoutine.pauseReason != null
                && jobRoutine.pauseReason.getCode() != InternalErrorCode.MANUAL_PAUSE_ERR
                && jobRoutine.pauseReason.getCode() != InternalErrorCode.TOO_MANY_FAILURE_ROWS_ERR
                && jobRoutine.pauseReason.getCode() != InternalErrorCode.CANNOT_RESUME_ERR) {

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39805 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit edb01a3bcdda5481e6a49dfeef6ce7d838c30f3b, data reload: false

------ Round 1 ----------------------------------
q1	18074	4479	4373	4373
q2	2848	191	194	191
q3	12039	1166	1024	1024
q4	10692	835	844	835
q5	7606	2721	2695	2695
q6	237	144	138	138
q7	972	592	593	592
q8	9222	2062	2089	2062
q9	8781	6596	6503	6503
q10	8679	3780	3839	3780
q11	445	234	237	234
q12	396	224	220	220
q13	17766	2984	2961	2961
q14	268	242	232	232
q15	534	487	500	487
q16	489	375	375	375
q17	970	607	712	607
q18	8074	7579	7344	7344
q19	4297	1397	1355	1355
q20	674	322	337	322
q21	4860	3194	3228	3194
q22	348	281	285	281
Total cold run time: 118271 ms
Total hot run time: 39805 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4395	4344	4255	4255
q2	373	273	268	268
q3	2999	2720	2708	2708
q4	1866	1549	1563	1549
q5	5270	5307	5286	5286
q6	225	129	129	129
q7	2147	1654	1705	1654
q8	3186	3358	3305	3305
q9	8355	8361	8348	8348
q10	3911	3688	3666	3666
q11	569	513	478	478
q12	787	629	615	615
q13	16424	2985	3024	2985
q14	300	274	267	267
q15	521	465	482	465
q16	466	405	425	405
q17	1748	1454	1473	1454
q18	7662	7669	7528	7528
q19	2154	1663	1440	1440
q20	2016	1815	1796	1796
q21	4848	4566	4697	4566
q22	590	481	490	481
Total cold run time: 70812 ms
Total hot run time: 53648 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173454 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit edb01a3bcdda5481e6a49dfeef6ce7d838c30f3b, data reload: false

query1	940	362	365	362
query2	6454	1899	1828	1828
query3	6653	207	225	207
query4	28574	17579	17444	17444
query5	4187	485	488	485
query6	311	192	167	167
query7	4595	303	292	292
query8	246	201	197	197
query9	8517	2419	2417	2417
query10	473	289	271	271
query11	11106	10223	10018	10018
query12	140	85	83	83
query13	1633	370	368	368
query14	10248	7940	7731	7731
query15	214	167	169	167
query16	7900	312	301	301
query17	1829	567	520	520
query18	1989	285	278	278
query19	191	150	143	143
query20	89	83	79	79
query21	202	130	121	121
query22	4320	4036	4074	4036
query23	33705	33130	33183	33130
query24	11915	2899	2829	2829
query25	656	366	365	365
query26	1842	145	144	144
query27	2873	272	269	269
query28	7546	2034	2036	2034
query29	1134	615	620	615
query30	282	149	147	147
query31	936	756	740	740
query32	96	51	56	51
query33	788	304	280	280
query34	954	484	493	484
query35	678	582	578	578
query36	1084	946	936	936
query37	209	79	77	77
query38	2886	2753	2777	2753
query39	864	807	827	807
query40	276	122	119	119
query41	49	52	48	48
query42	128	100	107	100
query43	512	450	471	450
query44	1254	747	725	725
query45	194	158	165	158
query46	1086	729	718	718
query47	1870	1770	1791	1770
query48	369	287	292	287
query49	1182	442	409	409
query50	772	389	396	389
query51	6819	6844	6737	6737
query52	111	88	100	88
query53	361	291	288	288
query54	1016	456	444	444
query55	78	75	74	74
query56	282	271	291	271
query57	1151	1043	1069	1043
query58	246	247	248	247
query59	2986	2784	2659	2659
query60	302	275	273	273
query61	100	95	95	95
query62	805	647	660	647
query63	354	297	290	290
query64	10438	2268	6364	2268
query65	3187	3128	3134	3128
query66	1372	338	331	331
query67	15382	15221	14943	14943
query68	4690	552	542	542
query69	458	329	320	320
query70	1168	1156	1169	1156
query71	401	283	279	279
query72	7221	5561	5478	5478
query73	759	323	320	320
query74	6176	5709	5621	5621
query75	3444	2723	2718	2718
query76	2815	912	915	912
query77	501	375	312	312
query78	9762	9568	8871	8871
query79	2306	516	532	516
query80	2425	489	468	468
query81	586	223	222	222
query82	743	138	131	131
query83	293	169	167	167
query84	267	89	88	88
query85	2251	315	303	303
query86	485	328	307	307
query87	3348	3133	3152	3133
query88	4137	2412	2350	2350
query89	481	383	387	383
query90	1806	196	190	190
query91	129	100	100	100
query92	66	48	51	48
query93	2338	514	516	514
query94	1209	215	213	213
query95	403	319	315	315
query96	607	274	267	267
query97	3189	3023	3001	3001
query98	219	202	192	192
query99	1550	1264	1276	1264
Total cold run time: 288174 ms
Total hot run time: 173454 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.9 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit edb01a3bcdda5481e6a49dfeef6ce7d838c30f3b, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.05
query3	0.23	0.04	0.05
query4	1.68	0.07	0.07
query5	0.49	0.49	0.50
query6	1.12	0.74	0.72
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.55	0.50	0.48
query10	0.54	0.54	0.55
query11	0.15	0.12	0.12
query12	0.15	0.12	0.12
query13	0.59	0.58	0.58
query14	0.78	0.77	0.82
query15	0.84	0.82	0.82
query16	0.37	0.35	0.35
query17	0.99	1.02	0.97
query18	0.22	0.22	0.21
query19	1.88	1.74	1.80
query20	0.01	0.01	0.01
query21	15.42	0.77	0.64
query22	4.89	7.57	1.20
query23	18.24	1.40	1.25
query24	2.13	0.23	0.24
query25	0.15	0.09	0.09
query26	0.32	0.22	0.21
query27	0.45	0.24	0.24
query28	13.22	1.01	1.01
query29	12.62	3.31	3.28
query30	0.25	0.06	0.05
query31	2.90	0.40	0.38
query32	3.24	0.46	0.47
query33	2.91	2.90	2.97
query34	17.16	4.32	4.34
query35	4.44	4.42	4.41
query36	0.65	0.47	0.49
query37	0.19	0.16	0.16
query38	0.16	0.15	0.15
query39	0.04	0.04	0.03
query40	0.15	0.12	0.12
query41	0.10	0.05	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.05
Total cold run time: 110.52 s
Total hot run time: 29.9 s

@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39672 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e42aef0a67b25f25fb6938a6e01a124274ba5658, data reload: false

------ Round 1 ----------------------------------
q1	17629	4386	4273	4273
q2	2014	192	188	188
q3	10460	1168	1010	1010
q4	10193	789	838	789
q5	7528	2712	2661	2661
q6	219	137	137	137
q7	963	600	609	600
q8	9232	2055	2087	2055
q9	8866	6565	6522	6522
q10	8840	3839	3767	3767
q11	465	242	244	242
q12	407	230	232	230
q13	18494	3007	2975	2975
q14	267	233	250	233
q15	529	480	489	480
q16	499	390	379	379
q17	959	749	710	710
q18	7944	7480	7339	7339
q19	3506	1417	1289	1289
q20	676	322	300	300
q21	5031	3274	3206	3206
q22	353	288	287	287
Total cold run time: 115074 ms
Total hot run time: 39672 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4367	4343	4262	4262
q2	371	272	268	268
q3	2973	2786	2896	2786
q4	1966	1770	1702	1702
q5	5599	5616	5537	5537
q6	227	140	136	136
q7	2227	1833	1860	1833
q8	3313	3432	3401	3401
q9	8848	8779	8860	8779
q10	4130	3913	3877	3877
q11	591	505	499	499
q12	778	611	643	611
q13	16192	3169	3170	3169
q14	331	314	294	294
q15	544	501	489	489
q16	493	429	441	429
q17	1813	1546	1532	1532
q18	8018	7983	7893	7893
q19	1714	1611	1514	1514
q20	2113	1907	1872	1872
q21	6165	4881	4948	4881
q22	575	506	518	506
Total cold run time: 73348 ms
Total hot run time: 56270 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173949 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e42aef0a67b25f25fb6938a6e01a124274ba5658, data reload: false

query1	921	383	371	371
query2	6468	1928	1832	1832
query3	6634	208	222	208
query4	28121	17767	17487	17487
query5	3718	490	495	490
query6	279	159	160	159
query7	4576	289	290	289
query8	240	199	197	197
query9	8401	2397	2388	2388
query10	433	286	273	273
query11	12088	10185	10185	10185
query12	115	85	88	85
query13	1633	370	362	362
query14	10309	7157	7902	7157
query15	221	174	169	169
query16	7633	327	312	312
query17	1347	545	525	525
query18	1929	277	277	277
query19	197	148	150	148
query20	90	80	80	80
query21	208	131	128	128
query22	4364	4053	4071	4053
query23	34212	34338	33617	33617
query24	11503	2912	2921	2912
query25	612	383	400	383
query26	1183	159	156	156
query27	2969	280	290	280
query28	7648	2077	2071	2071
query29	906	638	645	638
query30	252	157	152	152
query31	956	769	780	769
query32	92	55	56	55
query33	764	303	308	303
query34	1015	502	528	502
query35	687	613	617	613
query36	1157	1019	1000	1000
query37	164	89	94	89
query38	2913	2850	2890	2850
query39	944	822	892	822
query40	205	125	125	125
query41	47	46	49	46
query42	116	101	96	96
query43	501	467	463	463
query44	1204	719	723	719
query45	194	163	164	163
query46	1111	738	750	738
query47	1844	1757	1744	1744
query48	358	298	296	296
query49	853	407	423	407
query50	783	396	389	389
query51	6873	6949	6846	6846
query52	117	96	99	96
query53	363	294	287	287
query54	883	449	453	449
query55	74	75	75	75
query56	294	281	275	275
query57	1118	1060	1080	1060
query58	245	266	271	266
query59	2836	2635	2486	2486
query60	311	281	311	281
query61	98	98	95	95
query62	769	661	636	636
query63	333	299	295	295
query64	9500	2241	1677	1677
query65	3209	3143	3136	3136
query66	678	328	334	328
query67	15359	14787	14936	14787
query68	4837	586	550	550
query69	609	433	350	350
query70	1191	1169	1153	1153
query71	439	290	289	289
query72	7910	5968	5772	5772
query73	771	327	321	321
query74	6071	5669	5663	5663
query75	3400	2714	2708	2708
query76	3573	1009	959	959
query77	654	320	319	319
query78	9686	9735	8970	8970
query79	2674	539	527	527
query80	2066	488	485	485
query81	608	224	219	219
query82	1080	143	139	139
query83	303	172	166	166
query84	254	99	86	86
query85	1433	338	310	310
query86	360	344	317	317
query87	3269	3129	3069	3069
query88	3627	2380	2376	2376
query89	479	396	404	396
query90	1686	197	203	197
query91	152	114	119	114
query92	66	54	52	52
query93	2617	512	506	506
query94	772	230	299	230
query95	418	318	319	318
query96	602	269	273	269
query97	3240	3049	3018	3018
query98	218	201	201	201
query99	1550	1252	1268	1252
Total cold run time: 284793 ms
Total hot run time: 173949 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.53 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e42aef0a67b25f25fb6938a6e01a124274ba5658, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.23	0.05	0.06
query4	1.65	0.10	0.09
query5	0.49	0.48	0.49
query6	1.13	0.72	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.05
query9	0.56	0.48	0.48
query10	0.55	0.54	0.52
query11	0.16	0.12	0.12
query12	0.15	0.11	0.12
query13	0.60	0.59	0.59
query14	0.76	0.77	0.81
query15	0.85	0.82	0.81
query16	0.36	0.35	0.37
query17	0.98	0.97	1.00
query18	0.22	0.21	0.22
query19	1.78	1.71	1.72
query20	0.02	0.01	0.01
query21	15.41	0.76	0.66
query22	4.45	7.22	1.73
query23	18.25	1.42	1.28
query24	2.11	0.22	0.23
query25	0.15	0.09	0.08
query26	0.31	0.22	0.21
query27	0.46	0.22	0.22
query28	13.32	1.03	1.00
query29	12.59	3.36	3.35
query30	0.25	0.06	0.05
query31	2.86	0.40	0.39
query32	3.26	0.49	0.47
query33	2.87	2.97	2.96
query34	17.10	4.32	4.36
query35	4.47	4.45	4.51
query36	0.65	0.45	0.48
query37	0.19	0.16	0.16
query38	0.16	0.15	0.14
query39	0.05	0.04	0.03
query40	0.14	0.12	0.13
query41	0.09	0.04	0.05
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.92 s
Total hot run time: 30.53 s

@sollhui sollhui changed the title [fix](routine-load) fix auto resume invalid when FE leader change or single FE restart [fix](routine-load) fix auto resume invalid when FE leader change or FE restart Jul 16, 2024
@sollhui sollhui changed the title [fix](routine-load) fix auto resume invalid when FE leader change or FE restart [fix](routine-load) fix auto resume invalid when FE leader change or restart Jul 16, 2024
Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 16, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@XuJianxu
Copy link
Contributor

LGTM

@liaoxin01
Copy link
Contributor

run cloud_p0

@liaoxin01 liaoxin01 merged commit bec5ea8 into apache:master Jul 17, 2024
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants