Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitation, regarding Spark #121

Closed
tianyin opened this issue Jun 16, 2022 · 4 comments
Closed

Limitation, regarding Spark #121

tianyin opened this issue Jun 16, 2022 · 4 comments

Comments

@tianyin
Copy link
Member

tianyin commented Jun 16, 2022

First, when we discuss limitations, we want to be careful in differentiating fundamental limitations versus limitations of our own implementations, for research projects. Take #120 as an example, not implementing Java code analysis (which again if someone wants to learn, it'll be fun) is an implementation limitation, but not a fundamental limitation -- if we know how to do it in Go, it's straightforward to do the same analysis in Java (which is even simpler [1]).

OTOH, the fundamental limitation is that some of the FP pruning requires static source-code analysis: (1) It require the availability of source code, which may not apply to proprietary, closed-sourced operators; however, given that our target users are developers of operators, so it is not a concern; (2) It is not as applicable as a pure language-agnostic approach, because one needs to implement the same analysis for every language; (3) It's prohibitively difficult to implement imprecise static analysis, so it inevitably leads to soundness and completeness issues.

Back to the topic, I don't have an understanding that whether the short-lived operator is a fundamental research limitation or a limitation of the current Acto implementation. If it's the latter, it worries me less. Note that I understand engineering challenges are nontrivial. But, if it is fundamental, we should find a time to discuss it in depth.

[1] Java is simpler to analysis, because there are more mature tools.

@tylergu
Copy link
Member

tylergu commented Jun 16, 2022

I think it is something fundamental.

When we design Acto, we assumed the deployed application will reach to some stable state. This is why we always wait for the system to converge and then collect the cluster state. However, this assumption breaks in the case of the spark-operator. Each CR in this spark-operator is a workload. When users submit a CR, the operator submits a spark job and run it. Some executor pods will be spawned and terminated once finished running. @kevchentw @kevchentw Can you confirm that spark-operator deletes resources once the job finishes? I just tried it and it seems there are some pods being deleted.

The design of this spark application breaks our assumption that these systems have a stable state and makes it hard for Acto to capture the system state.

In fact, such cases happen in other operators too. There are sometimes one CRD for the system itself, another CRD for submitting the actually workload. We usually configure Acto to test the CRD for the system only.

In my opinion, we can argue that the workload of these systems are not in the management plane so it's out of the scope.

@tianyin
Copy link
Member Author

tianyin commented Jun 21, 2022

In my opinion, we can argue that the workload of these systems are not in the management plane so it's out of the scope.

It feels like a weak argument. Put indeed it's a low priority problem. Let's discuss in person and close this issue, leaving it for now.

@tianyin
Copy link
Member Author

tianyin commented Jun 22, 2022

had a discussion with @tylergu offline.

  • Acto relies on the observability of states while the Spark operator is used for running jobs which are often ephemeral. What's needed is a hook at the termination time to do a snapshot of the system state. Currently Acto does not support that. We need some investigation how to do it and what support is needed from the lower level (even we do not implement it).
  • The Spark operator actually does not follow the basic level-triggering principle of operators. It's arguable whether it's an ill fit or not.

@tianyin
Copy link
Member Author

tianyin commented Jun 22, 2022

dc9dc5f

@tianyin tianyin closed this as completed Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants