Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] SystemConnector to query system.runtime.tasks table #21416

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

aditi-pandit
Copy link
Contributor

@aditi-pandit aditi-pandit commented Nov 18, 2023

Description

SystemConnector is a Presto Connector for system tables. System tables include runtime schema tables like system.runtime.{nodes|tasks|queries|transactions}, properties tables (table properties, schema properties, column properties, analyze properties), hive & iceberg metadata tables.

SystemConnector tables are unique in a way that all of them are populated from metadata structures on the co-ordinator (and optionally from workers). This metadata can be internal process metadata for the runtime tables or metadata obtained from HMS/Iceberg catalog.

The distribution of the SystemTable can be ALL_NODES, ALL_COORDINATORS, SINGLE_COORDINATOR (from https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/SystemTable.java#L24)

Only one table 'system.runtime.tasks' is populated on ALL_NODES. So this table gets results from the co-ordinator as well as the workers.

In the past, querying this table was broken on Prestissimo since there was no system catalog/connector on workers. This PR enhances the native workers with a System connector/catalog that is used to populate the tasks table. The SystemConnector uses Presto TaskManager task map to populate this table.

There is one more design point. The Java co-ordinator is not fully compatible with native workers. They use different hash functions and different intermediate results for aggregations. So some changes were needed for running the system connector on the co-ordinator.

The changes are :

Both of these planning rules are applicable for tasks table as well since it generates data at both co-ordinator and workers.

Motivation and Context

system.runtime.tasks table is very frequently used in deployment scripts. Querying this table was broken in Prestissimo.
#21413

Test Plan

Added e2e tests

== RELEASE NOTES ==

General Changes
* Add support for querying system.runtime.tasks table in native clusters

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit nice change! some comments.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Is there a design document about this connector? I assume SystemConnector requires access to Metastore and we don't have that on the worker. Hence, wondering how this will work?

@mbasmanova
Copy link
Contributor

CC: @spershin

@aditi-pandit
Copy link
Contributor Author

@aditi-pandit Is there a design document about this connector? I assume SystemConnector requires access to Metastore and we don't have that on the worker. Hence, wondering how this will work?

@mbasmanova : This SystemConnector code was to query the tasks table. That seemed the only part of the SystemConnector that was needed at the worker for Prestissimo.

Other tables like nodes, queries were populated from in-memory structures in the co-ordinator itself. Any code accessing Metastore (like TablePropertiesSystemTable say) seemed to be required only at the co-ordinator part of the connector.

I just spent a day on this prototype to wire the pieces. I haven't put together a design doc.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Aditi, thank you for clarifying. It is interesting that tasks table is populated on the workers. I wonder why. All the information is available on the coordinator. CC: @tdcmeehan

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 22, 2024

@aditi-pandit Aditi, thank you for clarifying. It is interesting that tasks table is populated on the workers. I wonder why. All the information is available on the coordinator. CC: @tdcmeehan

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

@mbasmanova
Copy link
Contributor

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

@tdcmeehan Tim, thank you for clarifying. I didn't know about this deployment scheme. I'm not sure I understand how this works though. When there are multiple coordinators, wouldn't query results depend on which coordinator is being asked to process the query? Are you saying that in this setup a query can be routed to any coordinator and the results are expected to be the same? I guess in this case it is necessary to ask all the workers to report their tasks since as you pointed out a single coordinator knows about a subset of tasks only.

@mbasmanova
Copy link
Contributor

Generally speaking, Java workers are not compatible with native workers. They use different hash functions and different intermediate results for aggregations. Hence, we had to make a change to run system connector only on coordinator and introduce an exchange before partial agg. These changes may get in the way of making this PR work.

See #21725 and #21285

@tdcmeehan
Copy link
Contributor

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

@tdcmeehan Tim, thank you for clarifying. I didn't know about this deployment scheme. I'm not sure I understand how this works though. When there are multiple coordinators, wouldn't query results depend on which coordinator is being asked to process the query? Are you saying that in this setup a query can be routed to any coordinator and the results are expected to be the same? I guess in this case it is necessary to ask all the workers to report their tasks since as you pointed out a single coordinator knows about a subset of tasks only.

In this scheme, queries are sticky to a single coordinator (after you POST a query, each nextUri always returns the host of the local coordinator). It's presumed there's something in front of the cluster to distribute the query creation, like a load balancer.

@mbasmanova
Copy link
Contributor

@tdcmeehan Tim, thank you for clarifying. One more follow-up question. In a multi-coordinator deployment, do all workers report themselves to all coordinators or a given worker is fixed-assigned to just one coordinator? In other words, do we have N coordinators managing a shared pool of workers or we have just N "mini" clusters that are independent of each other?

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 28, 2024

@tdcmeehan Tim, thank you for clarifying. One more follow-up question. In a multi-coordinator deployment, do all workers report themselves to all coordinators or a given worker is fixed-assigned to just one coordinator? In other words, do we have N coordinators managing a shared pool of workers or we have just N "mini" clusters that are independent of each other?

Workers report themselves to a single discovery service, which is either replicated to other coordinators in an eventually consistent manner, or the discovery service is a single process which is separate from the coordinators. Originally, when this system connector was written, there was no concept of shared resources (e.g. resource groups, global memory management, etc.) and it relied purely on individual backpressure from workers, although there are now tools to help make that work.

@mbasmanova
Copy link
Contributor

@tdcmeehan Tim, I wonder if it still makes sense to support this deployment model. What do you think? Does it makes sense to consider it when thinking about native workers?

@tdcmeehan
Copy link
Contributor

Tactically and short term, I think it would be great to support this if there was an easy and not hacky way to get it to work with #21725 and #21285. But given that most people would be deploying Presto for their large to medium size data lakes, I don't think an Impala-style deployment model makes sense for Presto's future, and personally I feel comfortable saying we can deprecate it in the future.

That being said, system tables in the coordinator present a challenge for what I feel is one of the end goals of moving to native, which is simplifying our Java code. I'd like to think about a way to move this to C++ so it doesn't need to be present in the Java SPI (thinking way ahead in the future, if the only reason we retain page source providers is for system tables, I think it would be worthwhile to think about how to move system tables to C++). So I'd like to revisit the presumption at some point that system tables must be coordinator-provided tables, since even now that's not necessarily true.

@aditi-pandit
Copy link
Contributor Author

aditi-pandit commented Mar 1, 2024

@mbasmanova, @tdcmeehan : Thanks for the discussion. It has been informative.

If we want to stay with this approach of getting tasks table on worker we could modify #21725 and #21285 to not perform those rewrites for system.runtime.tasks table specifically as it based on the worker.

#21725 could work un-modified as well. It would just mean that we don't allow partial agg over the tasks table which might not be a big deal unless a massive numbers of queries are scheduled in the cluster.

wdyt ?

@majetideepak
Copy link
Collaborator

The other fixable issue we are hitting internally in a large setup when querying system tables is that the Native worker does not handle chunked HTTP responses yet. @tdcmeehan do you know what causes a chunked HTTP response from the coordinator? I tried reproducing with a large system table (many entries) but I could not.

https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoExchangeSource.cpp#L246

@mbasmanova
Copy link
Contributor

@majetideepak Chunked response used to be produced by task/async endpoint which was removed in #21772 . You should not be seeing issues if you update past that PR.

@majetideepak
Copy link
Collaborator

@mbasmanova thank you for the pointer!

@aditi-pandit aditi-pandit force-pushed the system_tables branch 4 times, most recently from 4aae242 to a79a770 Compare March 22, 2024 17:30
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

@aditi-pandit aditi-pandit force-pushed the system_tables branch 2 times, most recently from 4e3f161 to 5f792c2 Compare March 23, 2024 02:00
@aditi-pandit aditi-pandit force-pushed the system_tables branch 2 times, most recently from 8d7e1a7 to 25cda8c Compare March 26, 2024 21:53
@aditi-pandit aditi-pandit marked this pull request as ready for review March 26, 2024 21:59
@aditi-pandit aditi-pandit force-pushed the system_tables branch 2 times, most recently from 38e1e4e to 7819d14 Compare March 26, 2024 22:08
@aditi-pandit
Copy link
Contributor Author

@mbasmanova, @majetideepak : Thanks for your previous input. This code is looking good for a full review now. Looking forward to your comments.

@aditi-pandit
Copy link
Contributor Author

@mbasmanova, @tdcmeehan : Thanks for the discussion. It has been informative.

If we want to stay with this approach of getting tasks table on worker we could modify #21725 and #21285 to not perform those rewrites for system.runtime.tasks table specifically as it based on the worker.

#21725 could work un-modified as well. It would just mean that we don't allow partial agg over the tasks table which might not be a big deal unless a massive numbers of queries are scheduled in the cluster.

wdyt ?

tasks table gets data from all-nodes, so both the co-ordinator and workers. Since the co-ordinator generates data, both previous planner rules are also applicable.

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style related comments / questions. Otherwise looks good.

presto-native-execution/presto_cpp/main/SystemConnector.h Outdated Show resolved Hide resolved
class SystemTableHandle : public velox::connector::ConnectorTableHandle {
public:
explicit SystemTableHandle(
std::string connectorId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we usually prefer to pass by value and then move? Or pass by const reference and copy? When do we prefer one over the other?

Copy link
Contributor Author

@aditi-pandit aditi-pandit Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr : Good question. I prefer pass by const ref and copy to avoid use after move at the caller. But I've seen pass by value and move as a common pattern in Velox especially in PlanNode construction.

presto-native-execution/presto_cpp/main/SystemConnector.h Outdated Show resolved Hide resolved
presto-native-execution/presto_cpp/main/SystemConnector.h Outdated Show resolved Hide resolved
presto-native-execution/presto_cpp/main/SystemConnector.h Outdated Show resolved Hide resolved
@aditi-pandit aditi-pandit force-pushed the system_tables branch 3 times, most recently from c64d4a5 to b98a57c Compare April 23, 2024 06:27
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit some comments. Thanks!

presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
presto-native-execution/presto_cpp/main/PrestoTask.cpp Outdated Show resolved Hide resolved
presto-native-execution/presto_cpp/main/PrestoTask.cpp Outdated Show resolved Hide resolved
@@ -804,6 +807,9 @@ folly::dynamic PrestoTask::toJson() const {
obj["lastHeartbeatMs"] = lastHeartbeatMs;
obj["lastTaskStatsUpdateMs"] = lastTaskStatsUpdateMs;
obj["lastMemoryReservation"] = lastMemoryReservation;
obj["createTime"] = createTime;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we updating these values in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak : So in the other createTime fields the values were changed to a timestamp, so there was conversion back and forth. Hence, these new fields were added.


ConnectorTableLayoutHandle:
super: JsonEncodedSubclass
subclasses:
- { name: HiveTableLayoutHandle, key: hive }
- { name: IcebergTableLayoutHandle, key: hive-iceberg }
- { name: TpchTableLayoutHandle, key: tpch }
- { name: SystemTableLayoutHandle, key: $system }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need system and $system@system here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need the protocol json classes to be generated only once in this script. There isn't a need for all 3 catalog name mappings here. The mapping of the protocol to the key/catalog name happens in the PrestoToVeloxConnector code now. So that's where we have the 3 catalog name mappings.

Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice and a great tutorial on how to implement basic connectors!

@aditi-pandit aditi-pandit force-pushed the system_tables branch 3 times, most recently from 6672878 to faf9257 Compare April 25, 2024 00:39
@aditi-pandit
Copy link
Contributor Author

@majetideepak : Have addressed your review comments. Would appreciate another pass. Thanks !

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit few comments. Thanks!

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aditi-pandit

@aditi-pandit aditi-pandit merged commit aeaa0b7 into master Apr 26, 2024
59 checks passed
@aditi-pandit aditi-pandit deleted the system_tables branch April 27, 2024 01:29
@wanglinsong wanglinsong mentioned this pull request Jun 25, 2024
36 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants