Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global sort IMPORT INTO deadlock due to large tasks #52884

Closed
lance6716 opened this issue Apr 25, 2024 · 7 comments · Fixed by #52965
Closed

global sort IMPORT INTO deadlock due to large tasks #52884

lance6716 opened this issue Apr 25, 2024 · 7 comments · Fixed by #52965
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.

Comments

@lance6716
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

found by QA automatic test

2. What did you expect to see? (Required)

not deadlock

3. What did you see instead (Required)

[2024/04/25 12:24:43.369 +08:00] [INFO] [local.go:1318] ["start import engine"] [category=ddl-ingest] [uuid=a55d695a-5464-5ab9-bde5-ef29d87c2f79] ["region ranges"=1061] [count=0] [size=107374182400]
...
[2024/04/25 12:28:42.762 +08:00] [WARN] [local.go:1022] ["There is no pairs in range"] [category=ddl-ingest] [start=7480000000000001505F72800000000024155B] [end=7480000000000001505F72800000000022F0F9]
...
[2024/04/25 12:28:42.762 +08:00] [INFO] [reader.go:42] ["read all data start"]
...
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/a0510df6-1444-4c98-a1e6-6702b82afe91_stat/3] [startOffset=110431299] [endOffset=1345667994] [expectedConc=148] [concurrency=148]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/36c7dc76-b8ca-48ef-bb96-4387a8f47a7f_stat/2] [startOffset=1103174523] [endOffset=1608653871] [expectedConc=61] [concurrency=61]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/36c7dc76-b8ca-48ef-bb96-4387a8f47a7f_stat/3] [startOffset=0] [endOffset=522556353] [expectedConc=63] [concurrency=63]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/b8b9688f-5f10-4a06-9916-e354ba88edbc_stat/2] [startOffset=74000355] [endOffset=1309237050] [expectedConc=148] [concurrency=148]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/d7a98445-a276-4efb-a5f3-185f59cd9dc9_stat/2] [startOffset=280062882] [endOffset=1514161110] [expectedConc=148] [concurrency=148]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/de504840-c3cc-4fd0-a5be-7984fbb45cb4_stat/2] [startOffset=144585309] [endOffset=1514264607] [expectedConc=164] [concurrency=164]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/77a72fd2-34a7-496a-8d25-861f7905d239_stat/2] [startOffset=74000355] [endOffset=1309237050] [expectedConc=148] [concurrency=148]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/ef3ac2e5-9070-4102-8dc7-f9269cf0b3da_stat/3] [startOffset=110431299] [endOffset=1345667994] [expectedConc=148] [concurrency=148]
[2024/04/25 12:28:43.059 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/46a2d628-8d0c-4699-8c3b-159aab328f65_stat/2] [startOffset=278924415] [endOffset=1353637263] [expectedConc=129] [concurrency=129]

in global sort, there should not be "There is no pairs in range" case. And "read all data start" shows we are going to read very large data, which is also not expected

4. What is your TiDB version? (Required)

@lance6716 lance6716 added the type/bug The issue is confirmed as a bug. label Apr 25, 2024
@lance6716
Copy link
Contributor Author

maybe related to #51734

@lance6716
Copy link
Contributor Author

these are normal size task logs


[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/a0510df6-1444-4c98-a1e6-6702b82afe91_stat/3] [startOffset=934681407] [endOffset=1139605467] [expectedConc=25] [concurrency=25]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/b8b9688f-5f10-4a06-9916-e354ba88edbc_stat/2] [startOffset=897111996] [endOffset=1103174523] [expectedConc=25] [concurrency=25]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/d7a98445-a276-4efb-a5f3-185f59cd9dc9_stat/2] [startOffset=1103174523] [endOffset=1309237050] [expectedConc=25] [concurrency=25]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/de504840-c3cc-4fd0-a5be-7984fbb45cb4_stat/2] [startOffset=918742869] [endOffset=1304683182] [expectedConc=47] [concurrency=47]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/77a72fd2-34a7-496a-8d25-861f7905d239_stat/2] [startOffset=897111996] [endOffset=1103174523] [expectedConc=25] [concurrency=25]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/ef3ac2e5-9070-4102-8dc7-f9269cf0b3da_stat/3] [startOffset=934681407] [endOffset=1140743934] [expectedConc=25] [concurrency=25]
[2024/04/25 12:28:38.732 +08:00] [INFO] [engine.go:248] ["found hotspot file in getFilesReadConcurrency"] [filename=90002/30032/data/46a2d628-8d0c-4699-8c3b-159aab328f65_stat/2] [startOffset=897111996] [endOffset=1103174523] [expectedConc=25] [concurrency=25]

@lance6716
Copy link
Contributor Author

we can see that in some "read all data", start key is larger than end key

[2024/04/25 12:28:41.080 +08:00] [INFO] [reader.go:43] [arguments] [data-file-count=79] [stat-file-count=79] [start-key="dIAAAAAAAAFQX3KAAAAAACQVWw=="] [end-key="dIAAAAAAAAFQX3KAAAAAACLw+Q=="]
[2024/04/25 12:28:41.080 +08:00] [INFO] [util.go:52] ["seek props offsets start"]
[2024/04/25 12:28:41.170 +08:00] [INFO] [util.go:54] ["seek props offsets completed"] [takeTime=90.499466ms] []
[2024/04/25 12:28:41.621 +08:00] [INFO] [reader.go:63] ["read all data completed"] [takeTime=541.202346ms] []
[2024/04/25 12:28:41.621 +08:00] [INFO] [engine.go:77] ["building memKVsAndBuffers"] [sumKVCnt=0] [droppedSize=21213605]

@lance6716
Copy link
Contributor Author

select * from tidb_background_subtask where task_key = 90002 and id = 30033 limit 1\G

range-split-keys is not ascending

image

image

@lance6716
Copy link
Contributor Author

Write an unit test to reproduce it, found one stat file has wrong content

image

func TestName(t *testing.T) {
	url := ...
	backend, err := storage.ParseBackend(url, nil)
	require.NoError(t, err)
	ctx := context.Background()
	s, err := storage.New(ctx, backend, nil)
	require.NoError(t, err)

	m := &SortedKVMeta{}
	j := ...
	err = json.Unmarshal([]byte(j), m)
	require.NoError(t, err)

	m2 := &SortedKVMeta{}
	j2 := ...
	err = json.Unmarshal([]byte(j2), m2)
	require.NoError(t, err)

	multiFileStats := make([]MultipleFilesStat, 0)
	multiFileStats = append(multiFileStats, m.MultipleFilesStats...)
	multiFileStats = append(multiFileStats, m2.MultipleFilesStats...)

	for _, tt := range multiFileStats {
		for _, p := range tt.Filenames {
			statPath := p[1]
			rd, err3 := newStatsReader(ctx, s, statPath, 250*1024)
			require.NoError(t, err3)
			lastKey := []byte{}
			pp, err4 := rd.nextProp()
			for err4 != io.EOF {
				require.NoError(t, err4)
				require.True(t, bytes.Compare(lastKey, pp.firstKey) < 0,
					"statPath: %s, lastKey: %v, key: %v",
					statPath,
					base64.StdEncoding.EncodeToString(lastKey),
					base64.StdEncoding.EncodeToString(pp.firstKey))
				lastKey = slices.Clone(pp.firstKey)
				pp, err4 = rd.nextProp()
			}
			require.NoError(t, rd.Close())
		}
	}
	...
}

@lance6716
Copy link
Contributor Author

lance6716 commented Apr 28, 2024

print every "property" in stat file, found the problem.

...
&external.rangeProperty{firstKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x24, 0x4d, 0x11}, lastKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x24, 0x4d, 0x1b}, offset:0x4fde67eb, size:0x115e73, keys:0xb}
&external.rangeProperty{firstKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x24, 0x4d, 0x1c}, lastKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x22, 0x76, 0xf9}, offset:0x4fefc70e, size:0x115e73, keys:0xb}
&external.rangeProperty{firstKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x22, 0x76, 0xfa}, lastKey:[]uint8{0x74, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x50, 0x5f, 0x72, 0x80, 0x0, 0x0, 0x0, 0x0, 0x22, 0x77, 0x4}, offset:0x19449, size:0x115e73, keys:0xb}

And its data file has correct content

@lance6716
Copy link
Contributor Author

seems caused by incorrect retry


[2024/04/25 12:09:18.820 +08:00] [WARN] [s3.go:1183] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://nfss.ks3-cn-beijing-internal.ksyuncs.com/90002/30032/data/de504840-c3cc-4fd0-a5be-7984fbb45cb4/2?partNumber=81&uploadId=2c14832e064d46458cb2574aa2fdc737\": write tcp 10.200.52.145:52540->198.18.96.170:443: use of closed network connection"] [backoff=1.381972695s]
[2024/04/25 12:09:36.133 +08:00] [WARN] [s3.go:1092] ["upload to s3 failed"] [filename=90002/30032/data/de504840-c3cc-4fd0-a5be-7984fbb45cb4/2] [error="MultipartUpload: upload multipart failed\n\tupload id: 2c14832e064d46458cb2574aa2fdc737\ncaused by: InvalidArgument: Invalid Argument\n\tstatus code: 400, request id: fualm020kcmobs7lvt8qvma8lneim0cf, host id: "]
[2024/04/25 12:09:36.198 +08:00] [WARN] [writer.go:469] ["flush sorted kv failed"] [writer-id=data/de504840-c3cc-4fd0-a5be-7984fbb45cb4] [sequence-number=2] [error="MultipartUpload: upload multipart failed\n\tupload id: 2c14832e064d46458cb2574aa2fdc737\ncaused by: InvalidArgument: Invalid Argument\n\tstatus code: 400, request id: fualm020kcmobs7lvt8qvma8lneim0cf, host id: "] [retry-count=0]
[2024/04/25 12:11:53.486 +08:00] [INFO] [writer.go:570] ["flush sorted kv"] [writer-id=data/de504840-c3cc-4fd0-a5be-7984fbb45cb4] [sequence-number=2] [bytes=1609585344] [stat-size=191882] [write-time=2m17.287505007s] [write-speed(bytes/s)=11.18MiB]
[2024/04/25 12:11:53.486 +08:00] [INFO] [writer.go:479] ["flush kv"] [writer-id=data/de504840-c3cc-4fd0-a5be-7984fbb45cb4] [sequence-number=2] [bytes=1609585344] [kv-cnt=15552] [sort-time=8.071583ms] [write-time=4m29.374068546s] [sort-speed(kv/s)=1926759.5960] [writer-id=data/de504840-c3cc-4fd0-a5be-7984fbb45cb4]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants