Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opt](TabletSchema) reuse TabletColumn info to reduce mem #42448

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

eldenmoon
Copy link
Member

@eldenmoon eldenmoon commented Oct 24, 2024

  1. When there are a large number of identical TabletColumns in the cluster, which usually occurs when VARIANT type columns are modified and added, each Rowset has an individual TabletSchema. Excessive TabletSchemas can lead to significant memory overhead. Reusing memory for identical TabletColumns would greatly reduce this memory consumption.
  2. Serialized TabletSchema as LRU cache key could also increase memusage when large sets of schemas are in LRU cache, so inorder to reduce the memory footprint we just record the key signature caculated by generating an UUID by hash algorithm, and lookup the key signature in LRU cache, and check the key in case of hash collision

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@eldenmoon
Copy link
Member Author

run buildall

@eldenmoon eldenmoon marked this pull request as draft October 24, 2024 15:16
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@eldenmoon
Copy link
Member Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -942,7 +943,8 @@ void TabletSchema::clear_columns() {
_cols.clear();
}

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns) {
void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'init_from_pb' exceeds recommended size/complexity thresholds [readability-function-size]

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^
Additional context

be/src/olap/tablet_schema.cpp:945: 85 lines including whitespace and comments (threshold 80)

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^

@eldenmoon eldenmoon force-pushed the be_column_mem branch 3 times, most recently from 101870d to a98b6c0 Compare October 25, 2024 02:15
@eldenmoon
Copy link
Member Author

run buildall

@eldenmoon eldenmoon changed the title Be column mem [Opt](TabletSchema) reuse column info to reduce mem Oct 29, 2024
@eldenmoon eldenmoon changed the title [Opt](TabletSchema) reuse column info to reduce mem [Opt](TabletSchema) reuse TabletColumn info to reduce mem Oct 29, 2024
@eldenmoon eldenmoon force-pushed the be_column_mem branch 2 times, most recently from a6ae48d to 8e31fa5 Compare October 29, 2024 03:12
@eldenmoon eldenmoon marked this pull request as ready for review October 29, 2024 03:13
@eldenmoon
Copy link
Member Author

run buildall

@eldenmoon
Copy link
Member Author

run buildall

1 similar comment
@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.49% (9727/25946)
Line Coverage: 28.74% (80657/280661)
Region Coverage: 28.15% (41685/148059)
Branch Coverage: 24.72% (21178/85672)
Coverage Report: http://coverage.selectdb-in.cc/coverage/a4fb4efc304993000cafa4eac9ab68dc8fc56c7c_a4fb4efc304993000cafa4eac9ab68dc8fc56c7c/report/index.html

@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.96% (9853/25957)
Line Coverage: 29.23% (82034/280673)
Region Coverage: 28.57% (42299/148070)
Branch Coverage: 25.10% (21499/85666)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6df5f7abaaaab13594aad54e96f4994291637217_6df5f7abaaaab13594aad54e96f4994291637217/report/index.html

@eldenmoon eldenmoon force-pushed the be_column_mem branch 2 times, most recently from d0f8a69 to 0761331 Compare October 31, 2024 07:25
@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.96% (9857/25966)
Line Coverage: 29.20% (82098/281158)
Region Coverage: 28.45% (42358/148876)
Branch Coverage: 25.03% (21522/85990)
Coverage Report: http://coverage.selectdb-in.cc/coverage/076133182948d3e29e1df91e1938fd3c3b894996_076133182948d3e29e1df91e1938fd3c3b894996/report/index.html

@eldenmoon
Copy link
Member Author

run buildall

1. When there are a large number of identical TabletColumns in the cluster, which usually occurs when VARIANT type columns are modified and added, each Rowset has an individual TabletSchema. Excessive TabletSchemas can lead to significant memory overhead. Reusing memory for identical TabletColumns would greatly reduce this memory consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage when large sets of schemas are in LRU cache, so inorder to reduce the memory footprint we just record the key signature caculated by generating an UUID by hash algorithm, and lookup the key signature in LRU cache, and check the key in case of hash collision
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

}

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns) {
void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'init_from_pb' exceeds recommended size/complexity thresholds [readability-function-size]

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^
Additional context

be/src/olap/tablet_schema.cpp:956: 88 lines including whitespace and comments (threshold 80)

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^

@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.84% (9832/25984)
Line Coverage: 29.00% (81742/281883)
Region Coverage: 28.24% (42135/149221)
Branch Coverage: 24.82% (21378/86148)
Coverage Report: http://coverage.selectdb-in.cc/coverage/49747e9db3aa093b06c3f00ab923da751fa46d3f_49747e9db3aa093b06c3f00ab923da751fa46d3f/report/index.html

@eldenmoon
Copy link
Member Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

}

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns) {
void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'init_from_pb' exceeds recommended size/complexity thresholds [readability-function-size]

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^
Additional context

be/src/olap/tablet_schema.cpp:963: 88 lines including whitespace and comments (threshold 80)

void TabletSchema::init_from_pb(const TabletSchemaPB& schema, bool ignore_extracted_columns,
                   ^

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.89% (9847/25991)
Line Coverage: 29.05% (81907/281915)
Region Coverage: 28.28% (42185/149182)
Branch Coverage: 24.86% (21408/86104)
Coverage Report: http://coverage.selectdb-in.cc/coverage/7fcc7d58f1c335b94adc0584bc83d2ebd65bd90e_7fcc7d58f1c335b94adc0584bc83d2ebd65bd90e/report/index.html

FieldType::OLAP_FIELD_TYPE_ARRAY,
FieldType::OLAP_FIELD_TYPE_FLOAT,
};
if (column.is_extracted_column() && (invalid_types.contains(column.type()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check is_extracted_column?

static std::set<FieldType> invalid_types = {
FieldType::OLAP_FIELD_TYPE_DOUBLE,
FieldType::OLAP_FIELD_TYPE_JSONB,
FieldType::OLAP_FIELD_TYPE_ARRAY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array is supported by inverted index

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 6, 2024
Copy link
Contributor

github-actions bot commented Nov 6, 2024

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@qidaye qidaye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eldenmoon eldenmoon merged commit 743097a into apache:master Nov 6, 2024
25 of 28 checks passed
@eldenmoon eldenmoon deleted the be_column_mem branch November 6, 2024 06:10
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Nov 6, 2024
1. When there are a large number of identical TabletColumns in the
cluster, which usually occurs when VARIANT type columns are modified and
added, each Rowset has an individual TabletSchema. Excessive
TabletSchemas can lead to significant memory overhead. Reusing memory
for identical TabletColumns would greatly reduce this memory
consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage
when large sets of schemas are in LRU cache, so inorder to reduce the
memory footprint we just record the key signature caculated by
generating an UUID by hash algorithm, and lookup the key signature in
LRU cache, and check the key in case of hash collision
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Nov 6, 2024
1. When there are a large number of identical TabletColumns in the
cluster, which usually occurs when VARIANT type columns are modified and
added, each Rowset has an individual TabletSchema. Excessive
TabletSchemas can lead to significant memory overhead. Reusing memory
for identical TabletColumns would greatly reduce this memory
consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage
when large sets of schemas are in LRU cache, so inorder to reduce the
memory footprint we just record the key signature caculated by
generating an UUID by hash algorithm, and lookup the key signature in
LRU cache, and check the key in case of hash collision
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Nov 6, 2024
1. When there are a large number of identical TabletColumns in the
cluster, which usually occurs when VARIANT type columns are modified and
added, each Rowset has an individual TabletSchema. Excessive
TabletSchemas can lead to significant memory overhead. Reusing memory
for identical TabletColumns would greatly reduce this memory
consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage
when large sets of schemas are in LRU cache, so inorder to reduce the
memory footprint we just record the key signature caculated by
generating an UUID by hash algorithm, and lookup the key signature in
LRU cache, and check the key in case of hash collision
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Nov 6, 2024
1. When there are a large number of identical TabletColumns in the
cluster, which usually occurs when VARIANT type columns are modified and
added, each Rowset has an individual TabletSchema. Excessive
TabletSchemas can lead to significant memory overhead. Reusing memory
for identical TabletColumns would greatly reduce this memory
consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage
when large sets of schemas are in LRU cache, so inorder to reduce the
memory footprint we just record the key signature caculated by
generating an UUID by hash algorithm, and lookup the key signature in
LRU cache, and check the key in case of hash collision
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Nov 6, 2024
1. When there are a large number of identical TabletColumns in the
cluster, which usually occurs when VARIANT type columns are modified and
added, each Rowset has an individual TabletSchema. Excessive
TabletSchemas can lead to significant memory overhead. Reusing memory
for identical TabletColumns would greatly reduce this memory
consumption.
2. Serialized TabletSchema as LRU cache key could also increase memusage
when large sets of schemas are in LRU cache, so inorder to reduce the
memory footprint we just record the key signature caculated by
generating an UUID by hash algorithm, and lookup the key signature in
LRU cache, and check the key in case of hash collision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants