-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26196][SPARK-26281][WebUI] Total tasks title in the stage page is incorrect when there are failed or killed tasks and update duration metrics #23160
Conversation
37bcd62
to
bbd745a
Compare
Test build #99346 has finished for PR 23160 at commit
|
Test build #99345 has finished for PR 23160 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add tag [WebUI] in the PR title
What I meant was like in the screenshot posted above, whether the task table has 7 entries and they all show accurate information. |
reviewing, to clarify @pgandhi999 question, I assume the screenshots above are cut off and the task table is really showing 7 total failed tasks? |
@tgravescs Yes. Task table is showing all the tasks. Sorry, I didn't put the whole screen shot. |
@@ -610,7 +610,8 @@ $(document).ready(function () { | |||
$("#accumulator-table").DataTable(accumulatorConf); | |||
|
|||
// building tasks table that uses server side functionality | |||
var totalTasksToShow = responseBody.numCompleteTasks + responseBody.numActiveTasks; | |||
var totalTasksToShow = responseBody.numCompleteTasks + responseBody.numActiveTasks + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense. Looks good to me, however could you also please verify whether the data in tasks table reflects information about the failed / killed tasks as well, as I am passing the value of that variable in the ajax request while fetching data for task tables. Thank you for catching that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @pgandhi999 . All the failed tasks are displaying in the tasks table. Thanks for your great work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @shahidki31 !
Hi @pgandhi999 , There is a small nit there. Recently we fixed a bug related to the "duration" metrics in the tasks table (see #23081), but that hasn't reflected here. I have updated the code |
71aff34
to
661566b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that this might break the sorting on the Duration column in the tasks table as we are using the value of the name field in the column data to match with the appropriate name of the data field in the server data below in StagesResource:
withUI(_.store.taskList(stageId, stageAttemptId, pageStartIndex, pageLength, indexName(columnNameToSort), isAscendingStr.equalsIgnoreCase("asc"))) }
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L178.
So as far as I can think, I expect that if you sort the Duration column, it will perform sorting on row.duration instead of row.taskMetrics.executorRunTime, thus, not getting the desired results.
Can you please verify whether sorting on Duration column works as expected and fix if necessary? Thank you.
if (row.duration) { | ||
return type === 'display' ? formatDuration(row.duration) : row.duration; | ||
if (row.taskMetrics && row.taskMetrics.executorRunTime) { | ||
return type === 'display' ? formatDuration(row.taskMetrics.executorRunTime) : row.taskMetrics.executorRunTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. However, you also need to update the search filter method to not search on duration in StagesResource.scala here:
|| containsValue(f.duration.getOrElse(defaultOptionString))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks removed.
Test build #99542 has finished for PR 23160 at commit
|
Test build #99543 has finished for PR 23160 at commit
|
@pgandhi999 Actually we have mapped the "Duration" to "executorRunTime" for sorting, in the PR which I have mentioned above. So, after this PR the "Duration" is sorting correctly spark/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala Lines 813 to 815 in 327ac83
|
Test build #99549 has finished for PR 23160 at commit
|
LGTM |
Merged to master. @shahidki31 does this need to go in branch 2.4, 2.3? I tried back porting it, but looks like a lot of the affected code didn't exist in 2.4. If the fix can or should also be back-ported and you're willing, you're welcome to open another PR against 2.4. |
Thanks @srowen. This issue happens only in the master branch. Thank you all for the review and comments |
… is incorrect when there are failed or killed tasks and update duration metrics ## What changes were proposed in this pull request? This PR fixes 3 issues 1) Total tasks message in the tasks table is incorrect, when there are failed or killed tasks 2) Sorting of the "Duration" column is not correct 3) Duration in the aggregated tasks summary table and the tasks table and not matching. Total tasks = numCompleteTasks + numActiveTasks + numKilledTasks + numFailedTasks; Corrected the duration metrics in the tasks table as executorRunTime based on the PR apache#23081 ## How was this patch tested? test step: 1) ``` bin/spark-shell scala > sc.parallelize(1 to 100, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` ![screenshot from 2018-11-28 07-26-00](https://user-images.githubusercontent.com/23054875/49123523-e2691880-f2de-11e8-9c16-60d1865e6e77.png) After patch: ![screenshot from 2018-11-28 07-24-31](https://user-images.githubusercontent.com/23054875/49123525-e432dc00-f2de-11e8-89ca-4a53e19c9c18.png) 2) Duration metrics: Before patch: ![screenshot from 2018-12-06 03-25-14](https://user-images.githubusercontent.com/23054875/49546591-9e8d9900-f906-11e8-8a0b-157742c47655.png) After patch: ![screenshot from 2018-12-06 03-23-14](https://user-images.githubusercontent.com/23054875/49546589-9cc3d580-f906-11e8-827f-52ef8ffdeaec.png) Closes apache#23160 from shahidki31/totalTasks. Authored-by: Shahid <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
This PR fixes 3 issues
Total tasks = numCompleteTasks + numActiveTasks + numKilledTasks + numFailedTasks;
Corrected the duration metrics in the tasks table as executorRunTime based on the PR #23081
How was this patch tested?
test step:
1)
After patch:
Before patch:
After patch: