Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](csv reader) fix csv parser incorrect if enclosing line_delimiter (#38347) #38446

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 29, 2024

pick #38347

Csv reader parse data incorrect when data enclosing line_delimiter, for example, line_delimiter is \n and enclose is ', data as follows:

'aaaaaaaaaaaa
bbbb'

it will be parsed as two columns: 'aaaaaaaaaaaa and bbbb', rather than one column

'aaaaaaaaaaaa
bbbb'

The reason why this happened is csv reader will not reset result when not match enclose in this output_buf_read, causing incorrect truncation was made.

#38347)

Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <[email protected]>
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Jul 29, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 49520 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 15ef3f802931d8cf7dc9ebae165bb264f3b09998, data reload: false

------ Round 1 ----------------------------------
q1	17793	4360	4371	4360
q2	2131	153	144	144
q3	10463	1876	1905	1876
q4	10311	1271	1334	1271
q5	8597	3957	3942	3942
q6	258	127	127	127
q7	2027	1610	1613	1610
q8	9520	2722	2713	2713
q9	13664	10355	10096	10096
q10	8649	3499	3514	3499
q11	425	248	240	240
q12	467	311	306	306
q13	18328	3919	4030	3919
q14	347	324	333	324
q15	499	471	462	462
q16	663	580	570	570
q17	1130	913	937	913
q18	7247	6702	6797	6702
q19	1783	1627	1720	1627
q20	520	306	292	292
q21	4408	4143	4086	4086
q22	527	450	441	441
Total cold run time: 119757 ms
Total hot run time: 49520 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4323	4290	4343	4290
q2	321	221	227	221
q3	4164	4171	4117	4117
q4	2737	2770	2744	2744
q5	7153	7143	7127	7127
q6	237	120	126	120
q7	3281	2871	2865	2865
q8	4363	4516	4493	4493
q9	16830	16679	16752	16679
q10	4233	4282	4221	4221
q11	758	676	674	674
q12	1032	853	854	853
q13	6913	3733	3715	3715
q14	459	417	432	417
q15	492	454	457	454
q16	727	679	684	679
q17	3752	3909	3807	3807
q18	8716	8751	8920	8751
q19	1742	1643	1680	1643
q20	2353	2091	2108	2091
q21	8552	8373	8450	8373
q22	1062	998	1003	998
Total cold run time: 84200 ms
Total hot run time: 79332 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.82% (8120/21471)
Line Coverage: 29.48% (66582/225868)
Region Coverage: 28.97% (34330/118510)
Branch Coverage: 24.84% (17638/71002)
Coverage Report: http://coverage.selectdb-in.cc/coverage/15ef3f802931d8cf7dc9ebae165bb264f3b09998_15ef3f802931d8cf7dc9ebae165bb264f3b09998/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 202838 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 15ef3f802931d8cf7dc9ebae165bb264f3b09998, data reload: false

query1	922	418	375	375
query2	6526	2929	2548	2548
query3	6915	211	200	200
query4	20847	17946	17917	17917
query5	19739	6541	6459	6459
query6	281	219	241	219
query7	4171	297	302	297
query8	414	415	410	410
query9	3119	2696	2637	2637
query10	410	316	305	305
query11	11309	10828	10793	10793
query12	125	75	75	75
query13	5603	685	686	685
query14	17757	13099	13357	13099
query15	369	240	243	240
query16	6453	275	257	257
query17	1760	1451	868	868
query18	2333	401	408	401
query19	212	150	150	150
query20	79	82	80	80
query21	191	95	90	90
query22	5340	5070	5041	5041
query23	32600	31801	31894	31801
query24	7023	6417	6577	6417
query25	521	416	424	416
query26	528	164	157	157
query27	1886	290	295	290
query28	6092	2383	2361	2361
query29	2848	2883	2793	2793
query30	247	167	167	167
query31	896	726	751	726
query32	69	66	60	60
query33	402	261	251	251
query34	852	469	480	469
query35	1133	921	924	921
query36	1647	1161	1168	1161
query37	92	59	60	59
query38	3096	2926	2929	2926
query39	1372	1333	1347	1333
query40	206	96	98	96
query41	47	45	51	45
query42	79	82	78	78
query43	665	742	876	742
query44	1108	717	715	715
query45	246	232	239	232
query46	1244	963	972	963
query47	1738	1885	1771	1771
query48	1002	708	718	708
query49	625	369	377	369
query50	859	628	604	604
query51	4800	4654	4672	4654
query52	90	81	83	81
query53	449	313	323	313
query54	2648	2427	2492	2427
query55	100	73	84	73
query56	222	224	209	209
query57	1179	1085	1110	1085
query58	222	195	195	195
query59	4235	3663	3673	3663
query60	212	194	209	194
query61	95	94	99	94
query62	826	519	459	459
query63	486	343	336	336
query64	4224	1599	1516	1516
query65	3636	3577	3539	3539
query66	683	379	371	371
query67	15787	15139	15711	15139
query68	8785	660	637	637
query69	576	338	344	338
query70	1765	1620	1453	1453
query71	408	318	317	317
query72	6545	3506	3500	3500
query73	734	325	320	320
query74	6338	6036	5865	5865
query75	5256	3730	3684	3684
query76	5428	1147	1192	1147
query77	838	271	254	254
query78	12665	11608	11692	11608
query79	8613	681	663	663
query80	1379	395	405	395
query81	498	234	239	234
query82	1407	99	100	99
query83	180	128	133	128
query84	258	69	69	69
query85	887	327	323	323
query86	336	290	283	283
query87	3275	3061	3014	3014
query88	4690	2297	2319	2297
query89	418	289	285	285
query90	1918	220	203	203
query91	168	140	139	139
query92	58	54	53	53
query93	5777	562	581	562
query94	702	207	211	207
query95	1111	1066	1049	1049
query96	634	328	333	328
query97	6457	6234	6325	6234
query98	177	175	184	175
query99	2959	850	979	850
Total cold run time: 315132 ms
Total hot run time: 202838 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.84 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 15ef3f802931d8cf7dc9ebae165bb264f3b09998, data reload: false

query1	0.03	0.02	0.02
query2	0.06	0.03	0.02
query3	0.26	0.05	0.04
query4	1.80	0.06	0.06
query5	0.54	0.52	0.53
query6	1.23	0.61	0.60
query7	0.02	0.01	0.00
query8	0.03	0.02	0.02
query9	0.53	0.50	0.48
query10	0.54	0.52	0.54
query11	0.13	0.08	0.09
query12	0.11	0.09	0.09
query13	0.61	0.61	0.62
query14	0.79	0.77	0.78
query15	0.78	0.76	0.76
query16	0.39	0.37	0.36
query17	0.99	1.04	1.00
query18	0.22	0.26	0.24
query19	1.91	1.86	1.86
query20	0.02	0.01	0.01
query21	15.46	0.56	0.56
query22	2.22	2.30	1.33
query23	17.22	1.05	1.03
query24	5.89	1.48	1.34
query25	0.36	0.09	0.06
query26	0.62	0.15	0.16
query27	0.04	0.04	0.04
query28	7.01	0.77	0.73
query29	12.64	2.30	2.32
query30	0.62	0.52	0.49
query31	2.81	0.38	0.36
query32	3.41	0.50	0.50
query33	3.04	3.06	3.04
query34	15.25	4.81	4.78
query35	4.86	4.82	4.85
query36	1.06	1.01	1.01
query37	0.06	0.04	0.05
query38	0.03	0.02	0.03
query39	0.02	0.02	0.01
query40	0.15	0.14	0.14
query41	0.07	0.02	0.01
query42	0.02	0.01	0.01
query43	0.03	0.02	0.01
Total cold run time: 103.88 s
Total hot run time: 30.84 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 15ef3f802931d8cf7dc9ebae165bb264f3b09998 with default session variables
Stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       21.5 seconds inserted 10000000 Rows, about 465K ops/s

@liaoxin01 liaoxin01 merged commit 5b8dda4 into apache:branch-2.0 Jul 29, 2024
23 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants