Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: cn crashed by FATAL "update incr record returns invalid affected rows" during stability test on distributed mode #16982

Closed
1 task done
aressu1985 opened this issue Jun 19, 2024 · 26 comments
Assignees
Milestone

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

1.2-dev

Commit ID

20f9de6

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
3*PROXY 3C 7G
- OS type:
- Others:

Actual Behavior

cn crashed by FATAL "update incr record returns invalid affected rows" during stability test on distributed mode.
[github@mo-srv-128 9567375395_nightly-20f9de6]$ kubectl -n mo-nightly-20f9de6-202406182258 logs -p stability-regression-dis-tp-cn-b4978 | grep FATAL
{"level":"FATAL","time":"2024/06/18 18:26:37.398981 +0000","name":"cn-service.incrservice","caller":"incrservice/store_sql.go:225","msg":"BUG: update incr record returns invalid affected rows","uuid":"32653035-3834-3564-3430-656339346131","update-sql":"update mo_increment_columns set offset = 38135697 \n\t\t\t\twhere table_id = 272500 and col_name = '__mo_fake_pk_col' and offset = 38125697","account":0,"table":272500,"col":"__mo_fake_pk_col","affected-rows":0,"select-all":"Cannot find tableID 272500 in table mo_increment_columns, accountid 0, txn: 01902c90ac257aafaf915b7d4e37989a/Active/S:1718734593997874275-1\tcol_name: __mo_fake_pk_col, table_id: 272499\n\tcol_name: __mo_fake_pk_col, table_id: 272412\n\tcol_name: __mo_fake_pk_col, table_id: 272500\n\tcol_name: task_id, table_id: 272394\n\tcol_name: id, table_id: 272402\n\tcol_name: id, table_id: 272403\n\tcol_name: id, table_id: 272407\n\tcol_name: id, table_id: 272408\n\tcol_name: id, table_id: 272409\n\tcol_name: __mo_fake_pk_col, table_id: 272411\n\tcol_name: user_id, table_id: 272415\n\tcol_name: account_id, table_id: 272417\n\tcol_name: version, table_id: 272417\n\tcol_name: role_id, table_id: 272419\n\tcol_name: function_id, table_id: 272424\n\tcol_name: configuration_id, table_id: 272426\n\tcol_name: proc_id, table_id: 272430\n\tcol_name: stage_id, table_id: 272432\n\tcol_name: __mo_fake_pk_col, table_id: 272450\n\tcol_name: __mo_fake_pk_col, table_id: 272452\n\tcol_name: __mo_fake_pk_col, table_id: 272454\n\tcol_name: __mo_fake_pk_col, table_id: 272456\n\tcol_name: __mo_fake_pk_col, table_id: 272457\n\tcol_name: __mo_fake_pk_col, table_id: 272463\n\tcol_name: __mo_fake_pk_col, table_id: 272464\n\tcol_name: __mo_fake_pk_col, table_id: 272465\n\tcol_name: __mo_fake_pk_col, table_id: 272466\n\tcol_name: __mo_fake_pk_col, table_id: 272467\n\tcol_name: __mo_fake_pk_col, table_id: 272468\n\tcol_name: __mo_fake_pk_col, table_id: 272469\n\tcol_name: __mo_fake_pk_col, table_id: 272470\n\tcol_name: __mo_fake_pk_col, table_id: 272471\n\tcol_name: __mo_fake_pk_col, table_id: 272472\n\tcol_name: __mo_fake_pk_col, table_id: 272473\n\tcol_name: __mo_fake_pk_col, table_id: 272405\n\tcol_name: __mo_fake_pk_col, table_id: 272406\n\tcol_name: __mo_fake_pk_col, table_id: 272410\n\tcol_name: cron_task_id, table_id: 272392\n\tcol_name: __mo_fake_pk_col, table_id: 272476\n\tcol_name: task_id, table_id: 272390\n\tcol_name: __mo_fake_pk_col, table_id: 272475\n","cost":"10m0.121435844s","ctx-done":true,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:225\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:136\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:160\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

mo log link:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22mY0%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-20f9de6-202406182258%5C%22%7D%20%7C%3D%20%60update%20incr%20record%20returns%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718733455244%22,%22to%22:%221718761993073%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

@ouyuanning
Copy link
Contributor

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22bg_%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-20f9de6-202406182258%5C%22%7D%20%7C%3D%20%60where%20table_id%20%3D%20272500%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718734199000%22,%22to%22:%221718734634000%22%7D%7D%7D&schemaVersion=1&orgId=1

image

从这里表面上看是:
1、01902c9066a377b183214f3e745d6c17 这个事务,在 【2024-06-19 02:16:19.590】先拿到了 38135697 这个offset。并且在【2024-06-19 02:16:26.715】先完成了update
01902c90ac257aafaf915b7d4e37989a 这个事务,在【2024-06-19 02:16:37.277】又拿到了 38135697 这个offset,但是update失败报错了

@zhangxu19830126
Copy link
Contributor

这个事务的select for update的锁被孤儿事务检测给释放了,所以update返回的行数是0,因为数据被其他事务修改了。如果遇到孤儿事务,需要重试。

@zhangxu19830126
Copy link
Contributor

fixed

@zhangxu19830126
Copy link
Contributor

这个不是孤儿事务的问题了,是update返回影响行数是0

@ouyuanning
Copy link
Contributor

还没时间看

@ouyuanning
Copy link
Contributor

辛苦俊洪和罗飞帮忙看看

@iamlinjunhong
Copy link
Contributor

在看其他 s-1

3 similar comments
@iamlinjunhong
Copy link
Contributor

在看其他 s-1

@iamlinjunhong
Copy link
Contributor

在看其他 s-1

@iamlinjunhong
Copy link
Contributor

在看其他 s-1

@aressu1985
Copy link
Contributor Author

update on 7.04:
[github@mo-srv-128 mo-tpcc]$ kubectl -n mo-e4d45ce-202407032138 logs -p stability-regression-dis-tp-cn-kd47b | grep FATAL
{"level":"FATAL","time":"2024/07/04 04:55:46.106871 +0000","name":"cn-service.incrservice","caller":"incrservice/store_sql.go:193","msg":"BUG: update incr record returns invalid affected rows","uuid":"31343732-3665-6432-3538-383965363232","update-sql":"update mo_increment_columns set offset = 174042324 \n\t\t\t\twhere table_id = 272500 and col_name = '__mo_fake_pk_col' and offset = 174032324","account":0,"table":272500,"col":"__mo_fake_pk_col","affected-rows":0,"cost":"1.910313124s","ctx-done":false,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:193\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:136\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:127\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22KrG%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-e4d45ce-202407032138%5C%22%7D%20%7C%3D%20%60update%20incr%20record%20returns%20invalid%20affected%20rows%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221720068567652%22,%22to%22:%221720069408320%22%7D%7D%7D&schemaVersion=1&orgId=1

@aressu1985 aressu1985 modified the milestones: 1.2.2, 1.3.0 Jul 7, 2024
@iamlinjunhong
Copy link
Contributor

加了日志,等复现

7 similar comments
@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

加了日志,等复现

@iamlinjunhong
Copy link
Contributor

原因是 lock table bind changed 了,commit 会报错, 不应该在事务里面对结果做判断,应该重试

@aressu1985
Copy link
Contributor Author

testing

@aressu1985
Copy link
Contributor Author

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants