[Improvement][Task] Improved way to collect yarn job's appIds #11262

Radeity · 2022-08-02T11:02:06Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Current way to collect appIds is scan log files and parse them, it's inefficient and will cause OOM if log file is large, which has been mentioned in issue#11214. This potential problem can only be permanently solved by changing a new way to collect appIds which avoid reading log files.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

github-actions · 2022-08-02T11:37:26Z

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

ruanwenjun · 2022-08-03T07:08:24Z

Do you have any good idea? AFAIK, we can use xx task SDK to submit task, and can get the appId from SDK, then we don't need to parse from log. Or we can optimize the currently parse method to avoid OOM.

Radeity · 2022-08-03T09:04:51Z

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.

Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.

Please let me know if you have any good suggestions!

ruanwenjun · 2022-08-03T09:29:30Z

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.

Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.

Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

Radeity · 2022-08-04T13:22:44Z

already

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as APPLICATION_WEB_PROXY_BASE. All yarn jobs' AM maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each type of task.

So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue(#4025). I'll carefully think about that.

Radeity · 2022-08-11T10:26:45Z

@ruanwenjun
Hi, i wanna ask some maybe dumb question. When worker failover, in the function of killYarnJob, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?

Feel sorry that i don't have the production environment, so I'm not sure whether it's a bug or i understand it wrong.

ruanwenjun · 2022-08-11T11:31:57Z

@ruanwenjun Hi, i wanna ask some maybe dumb question. When worker failover, in the function of killYarnJob, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?

Feel sorry that i don't have the production environment, so I'm not sure whether it's a bug or i understand it wrong.

This is a history issue, in the before, there exist a LogServer deploy at the worker's machine.

ruanwenjun · 2022-08-11T11:35:43Z

already

@ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
Please let me know if you have any good suggestions!

In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.

I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as APPLICATION_WEB_PROXY_BASE. All yarn jobs' AM maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each type of task.

So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue(#4025). I'll carefully think about that.

You need to make sure the agent can work for all yarn client.

Radeity · 2022-08-11T12:40:29Z

You need to make sure the agent can work for all yarn client.

@ruanwenjun I think most of yarn clients can share the same agent, cuz in these clients, AOP will intercept func submitApplication, except for submitting yarn job with JDBC connection, like beeline(hive server2, as mentioned in issue(#4025)), however, beeline may create an external JDBC connection, we can not kill an external yarn job, right? So, if we don't consider these special situations, we can use the same agent for all other yarn clients.

rickchengx · 2022-09-23T06:35:26Z

Hi, @Radeity @ruanwenjun

I agree that the current way of getting the yarn application id from the log is not elegant.
Just for discussion, there is another way to get yarn application id as below:

We can put some unique tags on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration --conf spark.yarn.tags some_unique_tag.
After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.

What do you think? Any comments or discussions are welcome.

Radeity · 2022-09-23T07:27:23Z

Hi, @Radeity @ruanwenjun

I agree that the current way of getting the yarn application id from the log is not elegant. Just for discussion, there is another way to get yarn application id as below:

We can put some unique tags on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration --conf spark.yarn.tags some_unique_tag.

After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.

What do you think? Any comments or discussions are welcome.

Hi, @rickchengx

First, thanks for your idea!

However, i think this way have two problems as follow:

Users may create ShellTask and submit not only one yarn job via command lines which is hard to add configuration.
Aop way will simply fetch applicationId and write it into appInfo.log file. I think it's maybe more efficiency than query it through unique tag. In fact, I don't get how your idea work? Would you like to explain more about it?

rickchengx · 2022-09-23T10:07:05Z

@Radeity , thanks for the reply.

Here is more info about the way by tagging:

DS can add some unique tags while building the command of yarn tasks (spark, flink, sqoop, mapreduce, etc.) But ShellTask is not included because DS is not responsible for building commands in shell task. The tag is automatically added by DS, and the user is unaware of it.
After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.Specifically, through a yarn client.

In addition, as for the AOP way, in general I feel that an additional jar package is required, and the way of outputting the application id to a separate file is kind of odd. Is there a more elegant way to implement it ? ( As much as possible to lighten things up that DS needs extra maintenance)

After all, we just need to get the yarn application id. Although the current method may not be elegant, it works in most cases. If we introduce a more complicated method (more dependency and an additional seprate file) to avoid the current method of obtaining the app id, it may cause unpredictable instability problems

Radeity · 2022-09-23T15:47:53Z

@rickchengx

Thanks for your detailed explanation.

Compared with the tag way, aop can handle shell task, in addition, not invade into DS task definition code. Also, an additional jar package is required, you're right, however, this temporary appInfo log file is just for fetching applicationId in time, when the task is done, appId will be written into TaskExecutionContext as same as original way.

Moreover, extra maintenance is only need when compute engines change their supported way to add configuration like java-opts or yarn client change its submit function which i really think not a big deal, cuz they have remained unchanged for many years. Think of, for example, Wechat pay has been used for many years and we can scan QR code to pay for something, it's already in widely use and will not suffer a sudden change. Anyway, i have to say, yarn client may update, new compute engine will come out, but for this aop way in DS, the cost of potential maintenance is relatively smaller enough than other code part, such as generated command line to submit spark task.

For the last point, i agree with you, stability is worth considering. For smooth transmition, my opinion is to keep both original and new aop way, provide extra configuration for user to choose how to fetch applicationId. If the aop way is stable enough, we can then consider whether to complete replace the original way.

What do you think of it? Any more elegant idea would be appreciated!

import aop way to collect yarn job's applicationId add new environment configuration for each type of yarn tasks to support aop add user property `appId.collect` for user to decide how to collect applicationId This closes apache#11262

Radeity added improvement make more easy to user or prompt friendly Waiting for reply Waiting for reply labels Aug 2, 2022

Radeity changed the title ~~[Improvement][Task] Improve way to collect yarn job's appIds~~ [Improvement][Task] Improved way to collect yarn job's appIds Aug 2, 2022

SbloodyS assigned Radeity Aug 3, 2022

SbloodyS added backend and removed Waiting for reply Waiting for reply labels Aug 3, 2022

Radeity mentioned this issue Sep 28, 2022

[Improvement][Task] Improved way to collect yarn job's appIds #12197

Merged

gabrywu closed this as completed in #12197 Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement][Task] Improved way to collect yarn job's appIds #11262

[Improvement][Task] Improved way to collect yarn job's appIds #11262

Radeity commented Aug 2, 2022

github-actions bot commented Aug 2, 2022

ruanwenjun commented Aug 3, 2022

Radeity commented Aug 3, 2022

ruanwenjun commented Aug 3, 2022 •

edited

Loading

Radeity commented Aug 4, 2022

Radeity commented Aug 11, 2022 •

edited

Loading

ruanwenjun commented Aug 11, 2022

ruanwenjun commented Aug 11, 2022

Radeity commented Aug 11, 2022 •

edited

Loading

rickchengx commented Sep 23, 2022 •

edited

Loading

Radeity commented Sep 23, 2022 •

edited

Loading

rickchengx commented Sep 23, 2022 •

edited

Loading

Radeity commented Sep 23, 2022 •

edited

Loading

[Improvement][Task] Improved way to collect yarn job's appIds #11262

[Improvement][Task] Improved way to collect yarn job's appIds #11262

Comments

Radeity commented Aug 2, 2022

Search before asking

Description

Are you willing to submit a PR?

Code of Conduct

github-actions bot commented Aug 2, 2022

ruanwenjun commented Aug 3, 2022

Radeity commented Aug 3, 2022

ruanwenjun commented Aug 3, 2022 • edited Loading

Radeity commented Aug 4, 2022

Radeity commented Aug 11, 2022 • edited Loading

ruanwenjun commented Aug 11, 2022

ruanwenjun commented Aug 11, 2022

Radeity commented Aug 11, 2022 • edited Loading

rickchengx commented Sep 23, 2022 • edited Loading

Radeity commented Sep 23, 2022 • edited Loading

rickchengx commented Sep 23, 2022 • edited Loading

Radeity commented Sep 23, 2022 • edited Loading

ruanwenjun commented Aug 3, 2022 •

edited

Loading

Radeity commented Aug 11, 2022 •

edited

Loading

Radeity commented Aug 11, 2022 •

edited

Loading

rickchengx commented Sep 23, 2022 •

edited

Loading

Radeity commented Sep 23, 2022 •

edited

Loading

rickchengx commented Sep 23, 2022 •

edited

Loading

Radeity commented Sep 23, 2022 •

edited

Loading