-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClickHouse exporter produces duplicates and poor compression without sorting attributes #33634
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Interesting information, we actually have a PR open to update the table (#33611) Considering the complexity of the materialized view required, it might be best to do this in the exporter code. Maps are unpredictable in Go, so we would need to convert it to a slice and sort it. Any thoughts on this approach? |
Yes, map sort can improve compression, clickhouse-go sdk support |
Removing |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
still valid |
This can be closed once #35725 is merged |
is this is an optimization? here is my table definition:
here is my insert:
seems the compressed bytes is no difference!
clickhouse server version:
|
@zdyj3170101136 this is both an optimization and correctness issue. Your example is incomplete without knowing the layout the Besides compression it is also a question of query efficiency, e.g. |
…3634 (#35725) #### Description Our attributes are stored as Map(String, String) in CH. By default the order of keys is undefined and as described in #33634 leads to worse compression and duplicates in `group by` (unless carefully accounted for). This PR uses the `column.IterableOrderedMap` facility from clickhouse-go to ensure fixed attribute key order. It is a reimplementation of #34598 that uses less allocations and is (arguably) somewhat more straightforward. I'm **opening this as a draft**, because this PR (and #34598) are blocked by ClickHouse/clickhouse-go#1365 (fixed in ClickHouse/clickhouse-go#1418) In addition, I'm trying to add the implementation of `column.IterableOrderedMap` used to clickhouse-go upstream: ClickHouse/clickhouse-go#1417 If it is accepted, I will amend this PR accordingly. #### Link to tracking issue Fixes #33634 #### Testing The IOM implementation was used in production independently. I'm planning to build otelcollector with this PR and cut over my production to it in the next few of days.
…en-telemetry#33634 (open-telemetry#35725) #### Description Our attributes are stored as Map(String, String) in CH. By default the order of keys is undefined and as described in open-telemetry#33634 leads to worse compression and duplicates in `group by` (unless carefully accounted for). This PR uses the `column.IterableOrderedMap` facility from clickhouse-go to ensure fixed attribute key order. It is a reimplementation of open-telemetry#34598 that uses less allocations and is (arguably) somewhat more straightforward. I'm **opening this as a draft**, because this PR (and open-telemetry#34598) are blocked by ClickHouse/clickhouse-go#1365 (fixed in ClickHouse/clickhouse-go#1418) In addition, I'm trying to add the implementation of `column.IterableOrderedMap` used to clickhouse-go upstream: ClickHouse/clickhouse-go#1417 If it is accepted, I will amend this PR accordingly. #### Link to tracking issue Fixes open-telemetry#33634 #### Testing The IOM implementation was used in production independently. I'm planning to build otelcollector with this PR and cut over my production to it in the next few of days.
Component(s)
exporter/clickhouse
Is your feature request related to a problem? Please describe.
The default table created by the exporter isn't a good pattern for optimizing compression and removing duplicates. ClickHouse does not sort the map values, so even though there may be duplicate records the order of their attributes may be different. This causes ClickHouse to treat them as unique records for storage and merge trees. This also effects ClickHouses compression so the same data takes up a lot more disk.
Describe the solution you'd like
We identified this issue and the solution was to use a NULL Engine for the primary table the Exporter writes to, then using a Materialized View we explicitly sort the attributes before insert.
mapSort(
Attributes) as Attributes,
After this the compression rate for billions of rows was greater than 250, making the storage needed much less. It also eliminated duplicates and helped streamline the increase functions so we could avoid extra processing.
This makes the initial table creation a bit trickier but it is critical in my experience.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: