Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement][Task] Improved way to collect yarn job's appIds #11262

Closed
3 tasks done
Radeity opened this issue Aug 2, 2022 · 13 comments · Fixed by #12197
Closed
3 tasks done

[Improvement][Task] Improved way to collect yarn job's appIds #11262

Radeity opened this issue Aug 2, 2022 · 13 comments · Fixed by #12197
Assignees
Labels
backend improvement make more easy to user or prompt friendly

Comments

@Radeity
Copy link
Member

Radeity commented Aug 2, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Current way to collect appIds is scan log files and parse them, it's inefficient and will cause OOM if log file is large, which has been mentioned in issue#11214. This potential problem can only be permanently solved by changing a new way to collect appIds which avoid reading log files.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Radeity Radeity added improvement make more easy to user or prompt friendly Waiting for reply Waiting for reply labels Aug 2, 2022
@github-actions
Copy link

github-actions bot commented Aug 2, 2022

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@Radeity Radeity changed the title [Improvement][Task] Improve way to collect yarn job's appIds [Improvement][Task] Improved way to collect yarn job's appIds Aug 2, 2022
@SbloodyS SbloodyS added backend and removed Waiting for reply Waiting for reply labels Aug 3, 2022
@ruanwenjun
Copy link
Member

Do you have any good idea? AFAIK, we can use xx task SDK to submit task, and can get the appId from SDK, then we don't need to parse from log. Or we can optimize the currently parse method to avoid OOM.

@Radeity
Copy link
Member Author

Radeity commented Aug 3, 2022

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.

Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.

Please let me know if you have any good suggestions!

@ruanwenjun
Copy link
Member

ruanwenjun commented Aug 3, 2022

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.

Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.

Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

@Radeity
Copy link
Member Author

Radeity commented Aug 4, 2022

already

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as APPLICATION_WEB_PROXY_BASE. All yarn jobs' AM maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each type of task.

So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue(#4025). I'll carefully think about that.

@Radeity
Copy link
Member Author

Radeity commented Aug 11, 2022

@ruanwenjun
Hi, i wanna ask some maybe dumb question. When worker failover, in the function of killYarnJob, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?

Feel sorry that i don't have the production environment, so I'm not sure whether it's a bug or i understand it wrong.

@ruanwenjun
Copy link
Member

@ruanwenjun Hi, i wanna ask some maybe dumb question. When worker failover, in the function of killYarnJob, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?

Feel sorry that i don't have the production environment, so I'm not sure whether it's a bug or i understand it wrong.

This is a history issue, in the before, there exist a LogServer deploy at the worker's machine.

@ruanwenjun
Copy link
Member

already

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as APPLICATION_WEB_PROXY_BASE. All yarn jobs' AM maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each type of task.

So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue(#4025). I'll carefully think about that.

You need to make sure the agent can work for all yarn client.

@Radeity
Copy link
Member Author

Radeity commented Aug 11, 2022

You need to make sure the agent can work for all yarn client.

@ruanwenjun I think most of yarn clients can share the same agent, cuz in these clients, AOP will intercept func submitApplication, except for submitting yarn job with JDBC connection, like beeline(hive server2, as mentioned in issue(#4025)), however, beeline may create an external JDBC connection, we can not kill an external yarn job, right? So, if we don't consider these special situations, we can use the same agent for all other yarn clients.

@rickchengx
Copy link
Contributor

rickchengx commented Sep 23, 2022

Hi, @Radeity @ruanwenjun

I agree that the current way of getting the yarn application id from the log is not elegant.
Just for discussion, there is another way to get yarn application id as below:

  1. We can put some unique tags on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration --conf spark.yarn.tags some_unique_tag.
  2. After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.

What do you think? Any comments or discussions are welcome.

@Radeity
Copy link
Member Author

Radeity commented Sep 23, 2022

Hi, @Radeity @ruanwenjun

I agree that the current way of getting the yarn application id from the log is not elegant. Just for discussion, there is another way to get yarn application id as below:

  1. We can put some unique tags on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration --conf spark.yarn.tags some_unique_tag.
  2. After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.

What do you think? Any comments or discussions are welcome.

Hi, @rickchengx

First, thanks for your idea! 

However, i think this way have two problems as follow:

  1. Users may create ShellTask and submit not only one yarn job via command lines which is hard to add configuration.
  2. Aop way will simply fetch applicationId and write it into appInfo.log file. I think it's maybe more efficiency than query it through unique tag. In fact, I don't get how your idea work? Would you like to explain more about it?

@rickchengx
Copy link
Contributor

rickchengx commented Sep 23, 2022

@Radeity , thanks for the reply.

Here is more info about the way by tagging:

  1. DS can add some unique tags while building the command of yarn tasks (spark, flink, sqoop, mapreduce, etc.) But ShellTask is not included because DS is not responsible for building commands in shell task. The tag is automatically added by DS, and the user is unaware of it.
  2. After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.Specifically, through a yarn client.

In addition, as for the AOP way, in general I feel that an additional jar package is required, and the way of outputting the application id to a separate file is kind of odd. Is there a more elegant way to implement it ? ( As much as possible to lighten things up that DS needs extra maintenance)

After all, we just need to get the yarn application id. Although the current method may not be elegant, it works in most cases. If we introduce a more complicated method (more dependency and an additional seprate file) to avoid the current method of obtaining the app id, it may cause unpredictable instability problems

@Radeity
Copy link
Member Author

Radeity commented Sep 23, 2022

@rickchengx

Thanks for your detailed explanation.

Compared with the tag way, aop can handle shell task, in addition, not invade into DS task definition code. Also, an additional jar package is required, you're right, however, this temporary appInfo log file is just for fetching applicationId in time, when the task is done, appId will be written into TaskExecutionContext as same as original way.

Moreover, extra maintenance is only need when compute engines change their supported way to add configuration like java-opts or yarn client change its submit function which i really think not a big deal, cuz they have remained unchanged for many years. Think of, for example, Wechat pay has been used for many years and we can scan QR code to pay for something, it's already in widely use and will not suffer a sudden change. Anyway, i have to say, yarn client may update, new compute engine will come out, but for this aop way in DS, the cost of potential maintenance is relatively smaller enough than other code part, such as generated command line to submit spark task.

For the last point, i agree with you, stability is worth considering. For smooth transmition, my opinion is to keep both original and new aop way, provide extra configuration for user to choose how to fetch applicationId. If the aop way is stable enough, we can then consider whether to complete replace the original way.

What do you think of it? Any more elegant idea would be appreciated!

Radeity added a commit to Radeity/dolphinscheduler that referenced this issue Sep 28, 2022
import aop way to collect yarn job's applicationId
add new environment configuration for each type of yarn tasks to support aop
add user property `appId.collect` for user to decide how to collect applicationId

This closes apache#11262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend improvement make more easy to user or prompt friendly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants