-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FATAL: partition_key not in partition hash_split #6890
Comments
I was testing replaying SPLIT_OP during bootstrap with the following scenario:
After that, I tried to pause and then resume tserver 1 and 2, trying to make tserver 3 to become the leader for most of the tablets. First, paused tserver 1, so it lost leadership. Then resumed tserver 1 and paused tserver 2, so it lost leadership. Then resumed tserver 2. After that I was doing some random stop/resume (and might be kill) yb-tserver. And finally got that crash. |
The node that crashed was tserver 3 (but might be some other nodes crashed too, don't have full logs now). |
Easier reproducer:
|
Summary: The problem description: - To resolve partition_key to tablet_id `MetaCache` was using `YBTable::FindPartitionStart` and then translating `partition_start_key` to `tablet_id` based on `MetaCache::TableData::tablets_by_partition`. Because the key of the `tablets_by_partition` map is the first (lowest) key of the partition, the same partition start key can be mapped to different tablet ids before and after splitting. - `YBTable::partitions` update is also invalidating `MetaCache` for this table - this would be sufficient if we only had a single `YBTable` instance for the table, because in this case `YBTable::partitions` would never be older than the partitions version used to fill `MetaCache::TableData::tablets_by_partition`. - But, it turned out we can have more than one `YBTable` instance for the same table and use them concurrently to send requests to t-server. This can cause the following scenario: 1) `yb_table_1`, `yb_table_2` are initialized and getting initial table partition containing tablet `T` serving keys `10..20`. 2) Some data is written into the Table, `MetaCache::TableData::tablets_by_partition` is mapping `partition_start_key = 10` to tablet `T`. 3) Tablet `T` is split into `T1` (keys `10..14`) and `T2` (keys `15..20`). 4) `yb_table_1` is used to send a request for the key `15` belonging now to `T2`. First, the request is getting to tablet `T` due to outdated `yb_table_1::partitions` and `MetaCache::TableData::tablets_by_partition`. Tablet `T` returns that it has been split or deleted (if it was deleted, `TabletInvoker` is trying to get new tablet T locations from master using `YBClient::LookupTabletById` and getting new table_partitions_version). In both cases, this leads to refreshing `yb_table_1` table partitions and invalidation of `MetaCache`, then request is finally getting to tablet `T2`. 5) `yb_table_2` is used to send a request for the key `15`. `yb_table_2::partitions` is still old, so it `yb_table_2::FindPartitionStart` returns `0` that is translated by updated `MetaCache::TableData::tablets_by_partition` into `T1` and `Batcher` is trying to route request to `T1` instead of `T2`. Solution: - Add `TableData::partitions` that is versioned partitions. - Maintain `TableData::tablets_by_partition` and `TableData::tablet_lookups_by_group` to correspond to the version of `TableData::partitions`. - Use `TableData::partitions` instead of `YBTable::partitions` for getting `partition_start_key` based on key we need to route to correct tablet. - Make sure the same table partitions version is used during the single cycle of resolving the key to the tablet and if this condition is violated, return an error that will trigger table partitions and `MetaCache::TableData` invalidation and update. - Add `RemoteTablet::last_known_partition_version` that is updated on each GetTable(t)Locations response from master and it means this is table partitions version we know tablet was serving data for the table. - If in step 4) from the problem description we get "tablet not found" error from the master - we compare `RemoteTablet::last_known_partition_version` with table partitions_version from the master response and trigger table partitions refresh on `YBClient` side (that also invalidates `MetaCache` for this table). Problem #2: - It can happen that between we send `LookupByKeyRpc` for specific `partition_group_start` key and receive back the response, `MetaCache` for this table is updated with new partitions list and therefore partition groups will change, so we can't use information from response to this RPC to provide data to lookups interested in partition group starting with `partition_group_start` key. - To resolve this, defined `LookupByKeyRpc::partition_group_start_` as `VersionedPartitionGroupStartKey` type that includes partition list version and implemented checks for all 3 versions to match (from `LookupByKeyRpc`, from `MetaCache::TableData` and from response to RPC). Test Plan: ``` ./bin/yb-ctl --rf=1 create --num_shards_per_tserver=1 --ysql_num_shards_per_tserver=1 --master_flags '"tablet_split_size_threshold_bytes=300000","enable_tracing=true"' --tserver_flags '"db_write_buffer_size=100000"' java -jar ~/code/yb-sample-apps/target/yb-sample-apps.jar --workload CassandraSecondaryIndex --nodes 127.0.0.1:9042 --num_threads_read 2 --num_threads_write 2 --num_unique_keys 10000000 --nouuid ``` Reviewers: sergei, bogdan, rsami, mbautin Reviewed By: mbautin Subscribers: mbautin, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10424
…itions list for co-located tables Summary: As a result of the fix for #6890 there was introduced a potential perf issue for the first lookup of tablet by key for colocated tables. Instead of sending 1 RPC when doing first lookup for colocated table and then reusing the result for all tables co-located with the first one, MetaCache is sending 1 more RPC each time another table co-located with the first one is queried to resolve tablet by key. Since all colocated tables share the same tablet, we can cache the locations on the first RPC to any co-located table and then reuse the result for any MetaCache::LookupTabletByKey calls for any other table co-located with the one already queried. Suppose we have colocated tables `Table1` and `Table2` sharing `Tablet0`, then behavior without the fix is the following: 1. Someone calls `MetaCache::LookupTabletByKey` for `Table1` and `partition_key=p` 2. `MetaCache` checks that it doesn’t have `TableData` for `Table1`, initializes `TableData` for `Table1` with the list of partitions for `Table1`, and sends an RPC to the master 3. Master returns tablet locations that contain tablet locations for both `Table1` and `Table2`, because they are colocated and share the same tablets set 4. `MetaCache` updates `TableData::tablets_by_partition` for `Table1` 5. Caller gets `Tablet0` as a response to `MetaCache::LookupTabletByKey` 6. Someone calls `MetaCache::LookupTabletByKey` for `Table2` and `partition_key=p` 7. `MetaCache` checks that it doesn’t have `TableData` for `Table2` and sends an RPC to the master And with the fix, at step 4 `MetaCache` will also initialize `TableData` for `Table2` using the same partitions list which was used for `Table1` and will update `TableData::tablets_by_partition` for both tables. So, at step 7, `MetaCache` will have `TableData` for `Table2` and will respond with the tablet without RPC to the master. - Fixed `MetaCache::ProcessTabletLocations` to reuse partitions list for co-located tables - Added ClientTest.ColocatedTablesLookupTablet - Moved most frequent VLOGS from level 4 to level 5 for `MetaCache` Test Plan: For ASAN/TSAN/release/debug: ``` ybd --gtest_filter ClientTest.ColocatedTablesLookupTablet -n 100 -- -p 1 ``` Reviewers: mbautin, bogdan Reviewed By: mbautin, bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10755
…resh Summary: The problem description: - To resolve partition_key to tablet_id `MetaCache` was using `YBTable::FindPartitionStart` and then translating `partition_start_key` to `tablet_id` based on `MetaCache::TableData::tablets_by_partition`. Because the key of the `tablets_by_partition` map is the first (lowest) key of the partition, the same partition start key can be mapped to different tablet ids before and after splitting. - `YBTable::partitions` update is also invalidating `MetaCache` for this table - this would be sufficient if we only had a single `YBTable` instance for the table, because in this case `YBTable::partitions` would never be older than the partitions version used to fill `MetaCache::TableData::tablets_by_partition`. - But, it turned out we can have more than one `YBTable` instance for the same table and use them concurrently to send requests to t-server. This can cause the following scenario: 1) `yb_table_1`, `yb_table_2` are initialized and getting initial table partition containing tablet `T` serving keys `10..20`. 2) Some data is written into the Table, `MetaCache::TableData::tablets_by_partition` is mapping `partition_start_key = 10` to tablet `T`. 3) Tablet `T` is split into `T1` (keys `10..14`) and `T2` (keys `15..20`). 4) `yb_table_1` is used to send a request for the key `15` belonging now to `T2`. First, the request is getting to tablet `T` due to outdated `yb_table_1::partitions` and `MetaCache::TableData::tablets_by_partition`. Tablet `T` returns that it has been split or deleted (if it was deleted, `TabletInvoker` is trying to get new tablet T locations from master using `YBClient::LookupTabletById` and getting new table_partitions_version). In both cases, this leads to refreshing `yb_table_1` table partitions and invalidation of `MetaCache`, then request is finally getting to tablet `T2`. 5) `yb_table_2` is used to send a request for the key `15`. `yb_table_2::partitions` is still old, so it `yb_table_2::FindPartitionStart` returns `0` that is translated by updated `MetaCache::TableData::tablets_by_partition` into `T1` and `Batcher` is trying to route request to `T1` instead of `T2`. Solution: - Add `TableData::partitions` that is versioned partitions. - Maintain `TableData::tablets_by_partition` and `TableData::tablet_lookups_by_group` to correspond to the version of `TableData::partitions`. - Use `TableData::partitions` instead of `YBTable::partitions` for getting `partition_start_key` based on key we need to route to correct tablet. - Make sure the same table partitions version is used during the single cycle of resolving the key to the tablet and if this condition is violated, return an error that will trigger table partitions and `MetaCache::TableData` invalidation and update. - Add `RemoteTablet::last_known_partition_version` that is updated on each GetTable(t)Locations response from master and it means this is table partitions version we know tablet was serving data for the table. - If in step 4) from the problem description we get "tablet not found" error from the master - we compare `RemoteTablet::last_known_partition_version` with table partitions_version from the master response and trigger table partitions refresh on `YBClient` side (that also invalidates `MetaCache` for this table). Problem yugabyte#2: - It can happen that between we send `LookupByKeyRpc` for specific `partition_group_start` key and receive back the response, `MetaCache` for this table is updated with new partitions list and therefore partition groups will change, so we can't use information from response to this RPC to provide data to lookups interested in partition group starting with `partition_group_start` key. - To resolve this, defined `LookupByKeyRpc::partition_group_start_` as `VersionedPartitionGroupStartKey` type that includes partition list version and implemented checks for all 3 versions to match (from `LookupByKeyRpc`, from `MetaCache::TableData` and from response to RPC). Test Plan: ``` ./bin/yb-ctl --rf=1 create --num_shards_per_tserver=1 --ysql_num_shards_per_tserver=1 --master_flags '"tablet_split_size_threshold_bytes=300000","enable_tracing=true"' --tserver_flags '"db_write_buffer_size=100000"' java -jar ~/code/yb-sample-apps/target/yb-sample-apps.jar --workload CassandraSecondaryIndex --nodes 127.0.0.1:9042 --num_threads_read 2 --num_threads_write 2 --num_unique_keys 10000000 --nouuid ``` Reviewers: sergei, bogdan, rsami, mbautin Reviewed By: mbautin Subscribers: mbautin, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10424
…use partitions list for co-located tables Summary: As a result of the fix for yugabyte#6890 there was introduced a potential perf issue for the first lookup of tablet by key for colocated tables. Instead of sending 1 RPC when doing first lookup for colocated table and then reusing the result for all tables co-located with the first one, MetaCache is sending 1 more RPC each time another table co-located with the first one is queried to resolve tablet by key. Since all colocated tables share the same tablet, we can cache the locations on the first RPC to any co-located table and then reuse the result for any MetaCache::LookupTabletByKey calls for any other table co-located with the one already queried. Suppose we have colocated tables `Table1` and `Table2` sharing `Tablet0`, then behavior without the fix is the following: 1. Someone calls `MetaCache::LookupTabletByKey` for `Table1` and `partition_key=p` 2. `MetaCache` checks that it doesn’t have `TableData` for `Table1`, initializes `TableData` for `Table1` with the list of partitions for `Table1`, and sends an RPC to the master 3. Master returns tablet locations that contain tablet locations for both `Table1` and `Table2`, because they are colocated and share the same tablets set 4. `MetaCache` updates `TableData::tablets_by_partition` for `Table1` 5. Caller gets `Tablet0` as a response to `MetaCache::LookupTabletByKey` 6. Someone calls `MetaCache::LookupTabletByKey` for `Table2` and `partition_key=p` 7. `MetaCache` checks that it doesn’t have `TableData` for `Table2` and sends an RPC to the master And with the fix, at step 4 `MetaCache` will also initialize `TableData` for `Table2` using the same partitions list which was used for `Table1` and will update `TableData::tablets_by_partition` for both tables. So, at step 7, `MetaCache` will have `TableData` for `Table2` and will respond with the tablet without RPC to the master. - Fixed `MetaCache::ProcessTabletLocations` to reuse partitions list for co-located tables - Added ClientTest.ColocatedTablesLookupTablet - Moved most frequent VLOGS from level 4 to level 5 for `MetaCache` Test Plan: For ASAN/TSAN/release/debug: ``` ybd --gtest_filter ClientTest.ColocatedTablesLookupTablet -n 100 -- -p 1 ``` Reviewers: mbautin, bogdan Reviewed By: mbautin, bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10755
yb-tserver crashed with the following FATAL log:
The text was updated successfully, but these errors were encountered: