Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks and update duration metrics #23160

Closed
wants to merge 3 commits into from

Conversation

shahidki31
Copy link
Contributor

@shahidki31 shahidki31 commented Nov 28, 2018

What changes were proposed in this pull request?

This PR fixes 3 issues

  1. Total tasks message in the tasks table is incorrect, when there are failed or killed tasks
  2. Sorting of the "Duration" column is not correct
  3. Duration in the aggregated tasks summary table and the tasks table and not matching.

Total tasks = numCompleteTasks + numActiveTasks + numKilledTasks + numFailedTasks;

Corrected the duration metrics in the tasks table as executorRunTime based on the PR #23081

How was this patch tested?

test step:
1)

bin/spark-shell 
scala > sc.parallelize(1 to 100, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()

screenshot from 2018-11-28 07-26-00

After patch:
screenshot from 2018-11-28 07-24-31

  1. Duration metrics:
    Before patch:
    screenshot from 2018-12-06 03-25-14

After patch:
screenshot from 2018-12-06 03-23-14

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99346 has finished for PR 23160 at commit bbd745a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99345 has finished for PR 23160 at commit 37bcd62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @gengliangwang

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add tag [WebUI] in the PR title

@gengliangwang
Copy link
Member

cc @tgravescs @pgandhi999

@pgandhi999
Copy link

What I meant was like in the screenshot posted above, whether the task table has 7 entries and they all show accurate information.

@shahidki31 shahidki31 changed the title [SPARK-26196]Total tasks title in the stage page is incorrect when there are failed or killed tasks [SPARK-26196][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks Nov 30, 2018
@tgravescs
Copy link
Contributor

reviewing, to clarify @pgandhi999 question, I assume the screenshots above are cut off and the task table is really showing 7 total failed tasks?

@shahidki31
Copy link
Contributor Author

@tgravescs Yes. Task table is showing all the tasks. Sorry, I didn't put the whole screen shot.

@@ -610,7 +610,8 @@ $(document).ready(function () {
$("#accumulator-table").DataTable(accumulatorConf);

// building tasks table that uses server side functionality
var totalTasksToShow = responseBody.numCompleteTasks + responseBody.numActiveTasks;
var totalTasksToShow = responseBody.numCompleteTasks + responseBody.numActiveTasks +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. Looks good to me, however could you also please verify whether the data in tasks table reflects information about the failed / killed tasks as well, as I am passing the value of that variable in the ajax request while fetching data for task tables. Thank you for catching that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @pgandhi999 . All the failed tasks are displaying in the tasks table. Thanks for your great work.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @shahidki31 !

@shahidki31
Copy link
Contributor Author

Hi @pgandhi999 , There is a small nit there. Recently we fixed a bug related to the "duration" metrics in the tasks table (see #23081), but that hasn't reflected here. I have updated the code

Copy link

@pgandhi999 pgandhi999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that this might break the sorting on the Duration column in the tasks table as we are using the value of the name field in the column data to match with the appropriate name of the data field in the server data below in StagesResource:

withUI(_.store.taskList(stageId, stageAttemptId, pageStartIndex, pageLength, indexName(columnNameToSort), isAscendingStr.equalsIgnoreCase("asc"))) }
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L178.
So as far as I can think, I expect that if you sort the Duration column, it will perform sorting on row.duration instead of row.taskMetrics.executorRunTime, thus, not getting the desired results.

Can you please verify whether sorting on Duration column works as expected and fix if necessary? Thank you.

if (row.duration) {
return type === 'display' ? formatDuration(row.duration) : row.duration;
if (row.taskMetrics && row.taskMetrics.executorRunTime) {
return type === 'display' ? formatDuration(row.taskMetrics.executorRunTime) : row.taskMetrics.executorRunTime;
Copy link

@pgandhi999 pgandhi999 Dec 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. However, you also need to update the search filter method to not search on duration in StagesResource.scala here:

|| containsValue(f.duration.getOrElse(defaultOptionString))

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L213.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks removed.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99542 has finished for PR 23160 at commit 71aff34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99543 has finished for PR 23160 at commit 661566b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

shahidki31 commented Dec 1, 2018

So as far as I can think, I expect that if you sort the Duration column, it will perform sorting on row.duration instead of row.taskMetrics.executorRunTime, thus, not getting the desired results.

@pgandhi999 Actually we have mapped the "Duration" to "executorRunTime" for sorting, in the PR which I have mentioned above. So, after this PR the "Duration" is sorting correctly

// SPARK-26109: Duration of task as executorRunTime to make it consistent with the
// aggregated tasks summary metrics table and the previous versions of Spark.
HEADER_DURATION -> TaskIndexNames.EXEC_RUN_TIME,

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99549 has finished for PR 23160 at commit 5369a86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pgandhi999
Copy link

LGTM

@shahidki31 shahidki31 changed the title [SPARK-26196][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks [SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks Dec 5, 2018
@shahidki31 shahidki31 changed the title [SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks [SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks adn update duration metrics Dec 5, 2018
@shahidki31 shahidki31 changed the title [SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks adn update duration metrics [SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks and update duration metrics Dec 5, 2018
@shahidki31
Copy link
Contributor Author

cc @tgravescs @srowen

@srowen
Copy link
Member

srowen commented Dec 7, 2018

Merged to master. @shahidki31 does this need to go in branch 2.4, 2.3? I tried back porting it, but looks like a lot of the affected code didn't exist in 2.4. If the fix can or should also be back-ported and you're willing, you're welcome to open another PR against 2.4.

@asfgit asfgit closed this in 3b8ae23 Dec 7, 2018
@shahidki31
Copy link
Contributor Author

shahidki31 commented Dec 8, 2018

Thanks @srowen. This issue happens only in the master branch.

Thank you all for the review and comments

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… is incorrect when there are failed or killed tasks and update duration metrics

## What changes were proposed in this pull request?
This PR fixes 3 issues
1) Total tasks message in the tasks table is incorrect, when there are failed or killed tasks
2) Sorting of the "Duration" column is not correct
3) Duration in the aggregated tasks summary table and the tasks table and not matching.

Total tasks  = numCompleteTasks +  numActiveTasks + numKilledTasks + numFailedTasks;

Corrected the duration metrics in the tasks table as executorRunTime based on the PR apache#23081

## How was this patch tested?
test step:
1)
```
bin/spark-shell
scala > sc.parallelize(1 to 100, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```
![screenshot from 2018-11-28 07-26-00](https://user-images.githubusercontent.com/23054875/49123523-e2691880-f2de-11e8-9c16-60d1865e6e77.png)

After patch:
![screenshot from 2018-11-28 07-24-31](https://user-images.githubusercontent.com/23054875/49123525-e432dc00-f2de-11e8-89ca-4a53e19c9c18.png)

2)  Duration metrics:
Before patch:
![screenshot from 2018-12-06 03-25-14](https://user-images.githubusercontent.com/23054875/49546591-9e8d9900-f906-11e8-8a0b-157742c47655.png)

After patch:
![screenshot from 2018-12-06 03-23-14](https://user-images.githubusercontent.com/23054875/49546589-9cc3d580-f906-11e8-827f-52ef8ffdeaec.png)

Closes apache#23160 from shahidki31/totalTasks.

Authored-by: Shahid <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants