Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: negative bitmapset member not allowed on query with sort #7384

Closed
dx034 opened this issue Oct 23, 2024 · 24 comments · Fixed by #7399
Closed

[Bug]: negative bitmapset member not allowed on query with sort #7384

dx034 opened this issue Oct 23, 2024 · 24 comments · Fixed by #7399

Comments

@dx034
Copy link

dx034 commented Oct 23, 2024

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Query executor, Query planner

What happened?

Since upgrading to PostgreSQL 17 with timescaledb 2.17, I encounter the following error when performing or planning certain queries with sorting: SQL Error [XX000]: ERROR: negative bitmapset member not allowed.

Sample query:
SELECT ts, col1, col2, col3 FROM ts_rr WHERE ident=66091 order by ts

The query does not fail if the order by clause is missing. ts is a TIMESTAMP column, ident is int. The table is partitioned by time with a chunk_time_interval of 24 hours. The error also occurs if a filter is applied to ts that would restrict the lookup to one chunk. This also happens if it only filters on uncompressed chunks.

The table contains several billion rows, but with the filter the query shouldn't return more than 50 rows.
VACUUM ANALYZE ts_rr was performed before executing the query. The error also occurs on other tables with similar structure.

This might be related to the query planner as the error already occurs when I execute the query with explain.

TimescaleDB version affected

2.17.1

PostgreSQL version used

17.0

What operating system did you use?

debian bookworm

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

SQL Error [XX000]: ERROR: negative bitmapset member not allowed

How can we reproduce the bug?

I'm not really sure how to debug this further, but I'm happy to provide more information. I created a similar table and added a subset of data to that table, but wasn't able to reproduce the error. On the actual tables, the error occurs on each query that sorts the results. Therefore, I believe the amount of rows in the table causes this error.

I believe that this is connected to Postgres 17, but as the upgrade to timescaledb 2.17 was performed shortly before upgrading Postgres, I'm not able to say for sure after which upgrade this started. However, this also occurs on chunks that were added after the upgrade, so I don't believe it's caused by upgrading.
@dx034 dx034 added the bug label Oct 23, 2024
@antekresic
Copy link
Contributor

I'm afraid version 2.7.1 doesn't have support for pg17 which was only introduced with 2.17

Is this a typo i.e. you wrote 2.7 instead of 2.17 ? If not, I would suggest upgrading to 2.17

Thanks.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Sorry, I meant 2.17.1. I initially upgraded with 2.17.0, but upgrading to 2.17.1 doesn't solve the problem.

@antekresic
Copy link
Contributor

Can you try this query on individual chunks? For instance, doing:

SELECT ts, col1, col2, col3 FROM _timescaledb_internal._hyper_x_x_chunk WHERE ident=66091 order by ts

If it succeeds, can you post the explain plan here? I suggest trying it on a compressed and uncompressed chunk as well.
This is just to pinpoint the chunk that might be causing this.

Thanks.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Querying the chunk directly doesn't cause the error and returns results as expected.

QUERY PLAN | ---------------------------------------------------------------------------------------------------------+ Index Only Scan using "6896_3571_ts_rr_pkey" on _hyper_8_6896_chunk (cost=0.57..38.23 rows=224 width=12)| Index Cond: ((ident = 66091) AND (ts >= (now() - '04:00:00'::interval)) AND (ts <= now())) |

The query I performed uses a restriction on the time to restrict it on one chunk, but the same query fails on the main table.

I'll try it on all chunks and will report back.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

I tested all chunks and the error occurs whenever I execute it on a compressed chunk. Executing on an uncompressed chunk directly does not lead to an error.

EDIT: It only affects chunks that were compressed since the update. So I guess there might be a data corruption in those chunks? Chunks compressed prior to upgrading are not affected.

@antekresic
Copy link
Contributor

Can you just send what compression settings are you using on this hypertable?

Thanks.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

From select * from timescaledb_information.compression_settings:

hypertable_schema|hypertable_name|attname|segmentby_column_index|orderby_column_index|orderby_asc|orderby_nullsfirst| -----------------+---------------+-------+----------------------+--------------------+-----------+------------------+ public |ts_rr |ident | 1| | | | public |ts_rr |ts | | 1|true |false |

Sorry, the output doesn't fully align, please let me know if you need another output format or info from another table.

@antekresic
Copy link
Contributor

Have you changed these settings on the hypertable? What I'm wondering if you have compressed chunks with different compression settings and if that might be causing the problem.

Can you check that in SELECT * from timescaledb_information.chunk_compression_settings ;?

I understand you probably have a lot of chunks, just trying to get to the bottom of the problem.

FWIW I don't think this is data corruption, more of a planning bug based on the fact data is present when the chunk is queried directly.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

I haven't changed the settings in a long time. Most of the tables affected do not have that many chunks as old chunks are deleted. Here's the output of the query. Again, it doesn't really align here but the rows look the same for all chunks:

ts_rr _timescaledb_internal._hyper_8_6842_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6751_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6765_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6771_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6702_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6782_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6855_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6794_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6802_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6695_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6812_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6825_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6673_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6833_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6683_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6723_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6714_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6733_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6863_chunk ident ts ts_rr _timescaledb_internal._hyper_8_6744_chunk ident ts

@antekresic
Copy link
Contributor

If there isn't a lot of chunks, you could potentially fix this by recompressing the old chunks but lets not do that until we get to the bottom of why this is happening in the first place.

I'm trying to reproduce this locally, any chance you can share the schema of your hypertable?

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Here's the schema. Please note that this seems to happen to all tables with a similar schema. In one of them, the ident column also has a different data type. I could try to recompress one chunk and see if manual compression fixes it?

As said before, chunks compressed before the upgrade to pg17 (and timescale 2.17) work fine, it's just the newly compressed chunks.

CREATE TABLE public.ts_rr ( ident int4 NOT NULL, ts timestamp(0) NOT NULL, cs text NULL, a int4 NULL, s int2 NULL, vs int2 NULL, hea int2 NULL, la int4 NULL, lo7 int4 NULL, CONSTRAINT ts_fr_pkey PRIMARY KEY (hex, ts) ) TABLESPACE spillover ;

@antekresic
Copy link
Contributor

Can you uncompress the new chunks manually which are causing issues and run the query again to get the explain plan?

I think recompressing a single chunk wouldn't help, you would have to do it to all the chunks on the hypertable. Since you have multiple tables with same structure that this is happening on, you can try doing it and see if it helps.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Uncompressing all affected compressed chunks of one table will take a while, I'd also have to check if disk space is sufficient for that.

For now, I decompressed one of the affected chunks. After decompressing, the query worked (it doesn't return any results as the time is out of range for the chunk):
Index Scan Backward using _hyper_8_6863_chunk_ts_rr_ts_idx on _hyper_8_6863_chunk (cost=0.58..2.80 rows=1 width=12) Index Cond: ((ts >= (now() - '04:00:00'::interval)) AND (ts <= now())) Filter: (ident = 66091)

I then compressed it again and ran the same query. The error occured again: SQL Error [XX000]: ERROR: negative bitmapset member not allowed.

Prior to this, I performed VACUUM on the whole database. No errors were logged during compression.

So I guess decompressing would make the error go away, but isn't feasible due to available disk space.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

One more observation: The error does not seem to occur when I don't filter by ident. Since this is the column the table is segmented by, I guess this filter causes the issue, not the filter by time (which I can remove and still get the same error).

@antekresic
Copy link
Contributor

antekresic commented Oct 23, 2024

Can you show what does the plan look like for older chunks which aren't causing this problem? Or actually, the whole plan would be good.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Can you show what does the plan look like for older chunks which aren't causing this problem? Or actually, the whole plan would be good.

Sure:
Custom Scan (DecompressChunk) on _hyper_8_6733_chunk (cost=1.57..3.14 rows=2000 width=12) Vectorized Filter: ((ts <= "timestamp"(now())) AND (ts >= "timestamp"((now() - '04:00:00'::interval)))) -> Index Scan using compress_hyper_10_6768_chunk__compressed_hypertable_10_ident__ts_ on compress_hyper_10_6768_chunk (cost=0.42..3.14 rows=2 width=60) Index Cond: (ident = 66091)

@antekresic
Copy link
Contributor

Would you mind doing:

\d+ _timescaledb_internal.compress_hyper_10_6768_chunk

Sorry for the long back and forth, I'm still trying to figure out whats going on here.

Thanks!

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

No worries, thanks for your extensive help! Here's the output:

Table "_timescaledb_internal.compress_hyper_10_6768_chunk" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description -----------------------+---------------------------------------+-----------+----------+---------+----------+-------------+--------------+------------- ident | integer | | | | plain | | 1000 | ts | _timescaledb_internal.compressed_data | | | | external | | 0 | cs | _timescaledb_internal.compressed_data | | | | extended | | 0 | a | _timescaledb_internal.compressed_data | | | | external | | 0 | s | _timescaledb_internal.compressed_data | | | | external | | 0 | vs | _timescaledb_internal.compressed_data | | | | external | | 0 | hea | _timescaledb_internal.compressed_data | | | | external | | 0 | la | _timescaledb_internal.compressed_data | | | | external | | 0 | lo7 | _timescaledb_internal.compressed_data | | | | external | | 0 | _ts_meta_count | integer | | | | plain | | 1000 | _ts_meta_sequence_num | integer | | | | plain | | 1000 | _ts_meta_min_1 | timestamp without time zone | | | | plain | | 1000 | _ts_meta_max_1 | timestamp without time zone | | | | plain | | 1000 | Indexes: "compress_hyper_10_6768_chunk__compressed_hypertable_10_hex__ts_" btree (ident, _ts_meta_sequence_num), tablespace "spillover" Tablespace: "spillover" Access method: heap Options: toast_tuple_target=128

Some further stats to give you an impression of the size of each chunk: _hyper_8_6863_chunk contains ~123m entries with 63148 distinct values of ident.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

The details you requested were for a chunk that worked. Was that intended? There's the details on the (recompressed) chunk that doesn't work. It looks different:

\d+ _timescaledb_internal._hyper_8_6863_chunk Table "_timescaledb_internal._hyper_8_6863_chunk" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description ----------+--------------------------------+-----------+----------+---------+----------+-------------+--------------+------------- ident | integer | | not null | | plain | | | ts | timestamp(0) without time zone | | not null | | plain | | | cs | text | | | | extended | | | a | integer | | | | plain | | | s | smallint | | | | plain | | | vs | smallint | | | | plain | | | hea | smallint | | | | plain | | | la | integer | | | | plain | | | lo7 | integer | | | | plain | | | Indexes: "6863_3553_ts_rr_pkey" PRIMARY KEY, btree (ident, ts) "_hyper_8_6863_chunk_ts_rr_ts_idx" btree (ts DESC), tablespace "spillover" Check constraints: "constraint_3553" CHECK (ts >= '2024-10-20 00:00:00'::timestamp without time zone AND ts < '2024-10-21 00:00:00'::timestamp without time zone) Inherits: ts_rr Tablespace: "spillover" Access method: heap

@antekresic
Copy link
Contributor

Could you check the max OID in your database?

SELECT MAX(oid) FROM pg_class;

Thanks.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

All the way at the top; 4294091789. I used to have some messages regarding vacuum to prevent wraparound, but since vacuuming more, that has disappeared. That's also why I ran vacuum on the db before replicating it here, hoped that this way it wouldn't play a role.

If it matters, the job with the highest oid is a timescaledb compression: relname: compress_hyper_6_2664_chunk__compressed_hypertable_6_ident__ts_me

@antekresic
Copy link
Contributor

Yeah, that's the problem. Bug is triggered by using oids improperly for certain queries so this happens.

I'm gonna work on a bugfix for this but in the meantime, for a workaround, you might want to consider migrating to a fresh instance to reset the oid values. Don't know how big of a problem is that for you.

Hope this helps.

@dx034
Copy link
Author

dx034 commented Oct 23, 2024

Migrating over to a fresh instance wouldn't really be feasible. But I've stopped sorting by ts and will sort in the calling application. That seems to fix it for now, performance penalty also isn't too big.

Many thanks for all your help, looking forward to the fix!

@vitabaks
Copy link

vitabaks commented Nov 2, 2024

We also encountered the error negative bitmapset member not allowed (SQLSTATE XX000) after upgrading to version 2.17.1. Unfortunately, we were unable to reproduce this error in staging environments and only experienced it in production.

@antekresic, I noticed that the MR has already been merged, but a new release has not yet been published. Could you please confirm if building from the source code of the 2.17.x branch would be sufficient?

UPD: Yes, it helped.

git clone --branch 2.17.x --single-branch https://github.com/timescale/timescaledb.git
cd timescaledb
./bootstrap
cd build && make
make install
# restart postgres

@akuzm akuzm added this to the TimescaleDB 2.17.2 milestone Nov 5, 2024
akuzm added a commit to akuzm/timescaledb that referenced this issue Nov 6, 2024
This release contains performance improvements and bug fixes since
the 2.17.1 release. We recommend that you upgrade at the next
available opportunity.

**Features**

**Bugfixes**
* timescale#7384 Fix using OIDs with bitmapsets
* timescale#7388 Use-after-free in vectorized grouping by segmentby columns.

**Thanks**
* @dx034 for reporting an issue with negative bitmapset members due to large OIDs
akuzm added a commit to akuzm/timescaledb that referenced this issue Nov 6, 2024
This release contains performance improvements and bug fixes since
the 2.17.1 release. We recommend that you upgrade at the next
available opportunity.

**Features**

**Bugfixes**
* timescale#7384 Fix using OIDs with bitmapsets
* timescale#7388 Use-after-free in vectorized grouping by segmentby columns.

**Thanks**
* @dx034 for reporting an issue with negative bitmapset members due to large OIDs
akuzm added a commit that referenced this issue Nov 6, 2024
This release contains performance improvements and bug fixes since the
2.17.1 release. We recommend that you upgrade at the next available
opportunity.

**Features**

**Bugfixes**
* #7384 Fix using OIDs with bitmapsets
* #7388 Use-after-free in vectorized grouping by segmentby columns

**Thanks**
* @dx034 for reporting an issue with negative bitmapset members due to
large OIDs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants