Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference #27053

Closed
wants to merge 183 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
81a4499
Add ResourceProfile, ExecutorResourceRequest, and TaskResourceRequest
tgravescs Oct 9, 2019
1108d90
Minor updates
tgravescs Oct 9, 2019
178c387
Add in support for Executor launch and register tracking the
tgravescs Oct 9, 2019
73071c0
Refactor to allow java API to work
tgravescs Oct 9, 2019
5666cf6
TaskResourceRequest takes a Double for amount to allow fractional
tgravescs Oct 9, 2019
ac4ff9f
Change ExecutorResourceRequest to use "" instead of Option
tgravescs Oct 9, 2019
0d9e3d4
Update unit tests
tgravescs Oct 9, 2019
30083bf
Merge branch 'SPARK-29415' of github.com:tgravescs/spark into SPARK-2…
tgravescs Oct 9, 2019
6fffbb9
Update names of ResourceRequest functions
tgravescs Oct 10, 2019
fd5751c
Add more executor backend suite tests for resource profiles
tgravescs Oct 11, 2019
1b03773
Update executor monitor suite
tgravescs Oct 11, 2019
0346a03
checkpoint
tgravescs Oct 21, 2019
b73ceaa
revert scala maven plugin version
tgravescs Oct 23, 2019
6c88bf8
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Oct 23, 2019
94f27fd
make ExecutorResourceRequest private for now
tgravescs Oct 28, 2019
faabfc4
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Oct 28, 2019
20f9102
Fix formatting
tgravescs Oct 28, 2019
2349cb7
remove uneeded logging
tgravescs Oct 28, 2019
9d02420
Add newline to end of Suites
tgravescs Oct 28, 2019
455b2f6
Fix java line length style issue
tgravescs Oct 28, 2019
2bb8d7b
fix spacing
tgravescs Oct 28, 2019
23dd114
Fix javadoc malformed html with <=
tgravescs Oct 28, 2019
de5d6d7
Main functionality in allocation manager
tgravescs Oct 30, 2019
942a657
Working dynamic allocation
tgravescs Oct 30, 2019
7c0ee8f
Checkpoint tests
tgravescs Oct 31, 2019
55c820e
Change allowed resources to HashSet and remove Equals since not used yet
tgravescs Oct 31, 2019
30d04e5
fix spelling ExecutorResourceRequest
tgravescs Nov 1, 2019
ca429a9
Add more documentation to ExecutorResourceRequest
tgravescs Nov 1, 2019
a5c1fb2
Fix removeExecutors to work with resource profiles
tgravescs Nov 4, 2019
f9936cc
Start updating executorallocation manager suite
tgravescs Nov 4, 2019
d66b362
ExecutorAllocationManagerSuite all passing
tgravescs Nov 5, 2019
233d7d4
Some cleanup
tgravescs Nov 5, 2019
44974d6
remove localityAwareTasks variable
tgravescs Nov 5, 2019
f10176a
Change hostToLocal to be resourceprofile id to map[host, count] and s…
tgravescs Nov 6, 2019
c812d2c
Checkpoint - doesn't build, modifying CoarseGrainedSchedulerBackend to
tgravescs Nov 6, 2019
9b836d1
Finish updating CoasreGrainedSchedulerBackend - remove numPending tra…
Nov 7, 2019
9fa68c4
Fix YarnSchedulerBackend call to requestTotalExecutors on stop
tgravescs Nov 7, 2019
b6c9567
Fix hasPendingTasks
tgravescs Nov 7, 2019
127d7c1
Start to change to api to have REsourceRequests
tgravescs Nov 8, 2019
f12bb62
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Nov 8, 2019
2f6185f
Add ExecutorResourceRequests and TaskResourceRequests with test passing
tgravescs Nov 8, 2019
b50816b
Cleanup and documentation
tgravescs Nov 8, 2019
03cf91b
Add memory helper functions that just take a string of value and unit
tgravescs Nov 8, 2019
21b5858
Fix indentation in javadoc
tgravescs Nov 8, 2019
051be86
Fix a couple tests
tgravescs Nov 8, 2019
ba496d2
Fix test import
tgravescs Nov 8, 2019
8978e77
Update yarn scheduler suite
tgravescs Nov 11, 2019
8669911
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Nov 11, 2019
d4f766d
Fix ResourceProfileSuite for internal confs
tgravescs Nov 11, 2019
1353c5c
Fix testing for resource profile internal confs
tgravescs Nov 11, 2019
4f44661
Fix unit tests
tgravescs Nov 11, 2019
eb8add3
Merge branch 'SPARK-29306-based29415' of github.com:tgravescs/spark i…
tgravescs Nov 11, 2019
4a38e7b
make sure to remove stage from resourceProfileIdToStageAttempt
tgravescs Nov 11, 2019
7bc9358
Checkpoint allocator - not building
tgravescs Nov 11, 2019
7a0d3ca
Remove units from the ExecutorResourceRequest, all the memory configs
tgravescs Nov 12, 2019
6b25e55
Revert changes to SQLAppStatusListener
tgravescs Nov 12, 2019
341b8ef
Code done fixing tests
tgravescs Nov 13, 2019
ddd2ced
Fix javadocs for ExecutorAllocationClient
tgravescs Nov 14, 2019
c6632c2
Add utility functions for getOrUpdate for a bunch of resource profile id
tgravescs Nov 14, 2019
a04c972
Merge branch 'SPARK-29148-based-SPARK-29306' of github.com:tgravescs/…
tgravescs Nov 14, 2019
85671b9
Fix mesos unit tests
tgravescs Nov 14, 2019
e4adb83
Rename variables
tgravescs Nov 14, 2019
246de3c
Update comments and fix pyspark memory description
tgravescs Nov 15, 2019
fd2089a
Merge branch 'SPARK-29415' of github.com:tgravescs/spark into SPARK-2…
tgravescs Nov 25, 2019
5a24b1b
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Nov 25, 2019
bef4e1a
Update to new TaskResourceRequests api
tgravescs Nov 25, 2019
cc10bef
Update unit tests from merge
tgravescs Nov 25, 2019
d9a42c2
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Nov 25, 2019
99f2c0b
Remove the task confs being sent to executors
tgravescs Nov 25, 2019
b260813
Cleanup
tgravescs Nov 25, 2019
3ddae49
cleanup
tgravescs Nov 25, 2019
e373e4b
Fix the ResourceProfile internal confs format to not have double
tgravescs Nov 26, 2019
3d96258
Add logging
tgravescs Nov 26, 2019
c3a449f
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Nov 26, 2019
3e4a92b
revert scala maven plugin version change
tgravescs Nov 26, 2019
4d1df96
revert pom changes
tgravescs Nov 26, 2019
15cf911
Remove unused import
tgravescs Nov 26, 2019
7ddc19c
Fix unit tests due to last minute changes
tgravescs Nov 26, 2019
25afd7f
Fix the numExecutorsPending call in the preferred placement strategy
tgravescs Dec 2, 2019
b97ec85
Add tests for ResourceProfile.numTasksPerExecutor
tgravescs Dec 2, 2019
b7a9f14
start adding test
tgravescs Dec 2, 2019
ed64c09
Merge branch 'SPARK-29306' of github.com:tgravescs/spark into SPARK-2…
tgravescs Dec 2, 2019
e845ae1
Fix merge issues
tgravescs Dec 3, 2019
057b1f9
Merge branch 'SPARK-29148-based-SPARK-29306' of github.com:tgravescs/…
tgravescs Dec 3, 2019
fd6366d
Update for merge issues
tgravescs Dec 3, 2019
f27d409
Fix types and add unit test
tgravescs Dec 8, 2019
9fcb534
Add scheduler handling ResourceProfile, checkpoint
tgravescs Dec 8, 2019
1d0fe60
Merge branch 'SPARK-29149-based-SPARK-29148' of github.com:tgravescs/…
tgravescs Dec 8, 2019
e9e6ddd
Fix typo in test - compiles
tgravescs Dec 8, 2019
d22284f
Add another test
tgravescs Dec 10, 2019
4427e9e
Fix maxNumExecutorsNeededPerResourceProfile to account for
tgravescs Dec 10, 2019
bef4ced
Merge branch 'SPARK-29148-based-SPARK-29306' of github.com:tgravescs/…
tgravescs Dec 10, 2019
8b40fb2
fix merge issue
tgravescs Dec 10, 2019
e2a9d55
Add test with multiple Resourceprofiles
tgravescs Dec 11, 2019
60886e7
Merge branch 'SPARK-29148-based-SPARK-29306' of github.com:tgravescs/…
tgravescs Dec 11, 2019
c3674a4
Update scheduler, a bunch of places using the CPUs_PER_TASK, and add in
tgravescs Dec 11, 2019
15d26e3
Merge branch 'SPARK-29149-based-SPARK-29148' of github.com:tgravescs/…
tgravescs Dec 11, 2019
7ef0d3f
Checkpoint changes to pass resourceProfileManager around more - tests
tgravescs Dec 11, 2019
80c6bcd
Change executorAllocationManager to only use ResourceProfileId instead
tgravescs Dec 11, 2019
bab0288
Finish converting to resourceProfileManager and updating tests for
tgravescs Dec 12, 2019
fdc090a
Fixed some issues - have issues with resource.gpu vs gpu in global
tgravescs Dec 13, 2019
9e925d8
figure out immutability
tgravescs Dec 13, 2019
e7c21e5
Add REsourceProfileManager file
tgravescs Dec 14, 2019
83e378d
Revert "figure out immutability"
Dec 14, 2019
a301ab2
cleanup and revert back resourcerequest and profile
Dec 16, 2019
28b7608
Add ImmutableResourceProfile
tgravescs Dec 16, 2019
be2bb88
change default resource to not use resource. in name
tgravescs Dec 16, 2019
877508b
fix tests
tgravescs Dec 16, 2019
ff13402
reformat some things and remove comments
tgravescs Dec 16, 2019
a197d04
Add more sanity checks
tgravescs Dec 17, 2019
1e81043
Update tests
tgravescs Dec 17, 2019
e2d9be1
Add tests for resource profile manager and split immutable resource p…
tgravescs Dec 17, 2019
b819a43
More testing fixes
tgravescs Dec 17, 2019
54ded5d
fix more unit tests
tgravescs Dec 17, 2019
795b79c
Fix tests and logs
tgravescs Dec 17, 2019
ff9a6d6
Fix the resource warnings and slot calculation to properly handle
tgravescs Dec 18, 2019
f0276c7
Add unit tests for warning on resources
tgravescs Dec 18, 2019
5ae3d75
Add tests for barrier and calculating slots
tgravescs Dec 18, 2019
052f4f5
cleanup and start to add rdd.withResources tests
tgravescs Dec 18, 2019
ca3348f
Add more scheduler tests
tgravescs Dec 19, 2019
284e6c5
tests and change executorMonitor to use concurrenthashmap
tgravescs Dec 19, 2019
dfb99ee
Fix yarn resource creatieon for different resource profiles
tgravescs Dec 19, 2019
ba6838b
Fix resources meeting requirements missing executor resource
tgravescs Dec 20, 2019
1dfa041
Add more merge conflict logic and tests
tgravescs Dec 20, 2019
e241506
Fix issue with merging profiles and tests
tgravescs Dec 20, 2019
7798c7e
testing updates
tgravescs Dec 20, 2019
32a4e36
Fix wrapping
tgravescs Dec 20, 2019
12ed7d9
Update based on review comments
tgravescs Dec 30, 2019
10beec8
cleanup
tgravescs Dec 30, 2019
12087fa
cleanup
tgravescs Dec 30, 2019
062d4ae
Fix style
tgravescs Dec 30, 2019
456aafa
Clean up style in tests
tgravescs Dec 30, 2019
8adaf14
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Dec 30, 2019
4587753
Fix test from upmerge
tgravescs Dec 30, 2019
0097946
Use reflection for Yarn getResources
tgravescs Dec 30, 2019
e057bdb
Fix scaladoc param name
tgravescs Dec 30, 2019
cc23d6b
Fix unit tests
tgravescs Dec 30, 2019
6df2ac4
Fix resetting the default profile in the container placement suite
tgravescs Dec 31, 2019
7d55397
Add in pyspark api for ResourceProfile and *ResourceRequests
tgravescs Dec 31, 2019
5147868
Add python api and fix PipelinedRDD in python to account for resource
tgravescs Jan 2, 2020
398ab73
Adjust logging and spacing in DagScheduler
tgravescs Jan 2, 2020
4eadea3
Use empty string instead of None for defaults for discoveryScript and
tgravescs Jan 2, 2020
94a7f6c
add pyspark tests and support pyspark.memory. Also fix the
tgravescs Jan 3, 2020
456db2c
Updates for pyspark support
tgravescs Jan 3, 2020
16eafbb
issue with copying resourceprofile
tgravescs Jan 3, 2020
d32e994
Change to build ResourceProfileBuilder
tgravescs Jan 3, 2020
4c1ebeb
add in resourceprofilebuilder.py
tgravescs Jan 3, 2020
20ba9de
remove ImmutableResourceProfileSuite
tgravescs Jan 6, 2020
d03da6f
update comment
tgravescs Jan 6, 2020
f341953
fix import order
tgravescs Jan 6, 2020
7289c5d
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Jan 6, 2020
5017575
Fix style
tgravescs Jan 6, 2020
eece6ae
remove log statements
tgravescs Jan 6, 2020
8edba2d
fix python style
tgravescs Jan 6, 2020
47cd5a7
Fix python doc style issues
tgravescs Jan 6, 2020
15d6b5d
Add TaskContext.resourceProfileId to mima excludes
tgravescs Jan 7, 2020
eba244f
Update test for ResourceProfileBuilder
tgravescs Jan 7, 2020
5f6c115
Fix the resourceprofile test
tgravescs Jan 8, 2020
04a63cd
Fix the python style for test_rdd
tgravescs Jan 8, 2020
7699028
Change to use RetrieveSparkAppConfig to transfer the resource profile
tgravescs Jan 9, 2020
d0e3f03
checkpoint changing to pass serialized ResourceProfile to executor wi…
tgravescs Jan 9, 2020
971fc59
Final changes to use RetrieveSparkAppConfig
tgravescs Jan 9, 2020
ec1deab
pull in master changes
tgravescs Jan 10, 2020
4e4d899
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Jan 10, 2020
529bc96
Update based on executor pr changes
tgravescs Jan 10, 2020
17f9928
Add web UI changes for stage level scheduling. Environment page
tgravescs Jan 11, 2020
56140fb
Fix test from upmerge
tgravescs Jan 11, 2020
9c3dc4d
Pull in Executor PR changes
tgravescs Jan 11, 2020
a9b5ee0
revert python run-test changes
tgravescs Jan 11, 2020
d967232
Add Resource Profile info to the Environment Rest endpoint and add tests
tgravescs Jan 11, 2020
81e094a
Fix up unit tests from merge
tgravescs Jan 12, 2020
5d84427
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Jan 12, 2020
6011ec6
Fix test
tgravescs Jan 12, 2020
735de20
rat exclude new test file
tgravescs Jan 12, 2020
7b10429
Fix imports
tgravescs Jan 12, 2020
cacaa0a
Create different resource directories so tests between python versions
tgravescs Jan 13, 2020
8cbdadd
Add some formatting to the UI for resource profiles on env page
tgravescs Jan 13, 2020
f99f1cf
Update test to clear default profile before
tgravescs Jan 13, 2020
954ba00
Fix typo in test
tgravescs Jan 14, 2020
8f40a0c
Merge branch 'master' of https://github.com/apache/spark into SPARK-2…
tgravescs Jan 21, 2020
585df54
Fix merge issues
tgravescs Jan 22, 2020
976b912
Fix python test hang by clearing the default profile in SparkContext
tgravescs Jan 23, 2020
0738be0
Add () to calls to clearResourceProfile
tgravescs Jan 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,11 @@ public void onSpeculativeTaskSubmitted(SparkListenerSpeculativeTaskSubmitted spe
onEvent(speculativeTask);
}

@Override
public void onResourceProfileAdded(SparkListenerResourceProfileAdded event) {
onEvent(event);
}

@Override
public void onOtherEvent(SparkListenerEvent event) {
onEvent(event);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ <h4 class="title-table">Executors</h4>
<th>Disk Used</th>
<th>Cores</th>
<th>Resources</th>
<th>Resource Profile ID</th>
<th><span data-toggle="tooltip" data-placement="top" title="Number of tasks currently executing. Darker shading highlights executors with more active tasks.">Active Tasks</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Number of tasks that have failed on this executor. Darker shading highlights executors with a high proportion of failed tasks.">Failed Tasks</span></th>
<th>Complete Tasks</th>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ function totalDurationColor(totalGCTime, totalDuration) {
}

var sumOptionalColumns = [3, 4];
var execOptionalColumns = [5, 6, 9];
var execOptionalColumns = [5, 6, 9, 10];
var execDataTable;
var sumDataTable;

Expand Down Expand Up @@ -415,6 +415,7 @@ $(document).ready(function () {
{data: 'diskUsed', render: formatBytes},
{data: 'totalCores'},
{name: 'resourcesCol', data: 'resources', render: formatResourceCells, orderable: false},
{name: 'resourceProfileIdCol', data: 'resourceProfileId'},
{
data: 'activeTasks',
"fnCreatedCell": function (nTd, sData, oData, iRow, iCol) {
Expand Down Expand Up @@ -461,7 +462,8 @@ $(document).ready(function () {
"columnDefs": [
{"visible": false, "targets": 5},
{"visible": false, "targets": 6},
{"visible": false, "targets": 9}
{"visible": false, "targets": 9},
{"visible": false, "targets": 10}
],
"deferRender": true
};
Expand Down Expand Up @@ -570,6 +572,7 @@ $(document).ready(function () {
"<div id='on_heap_memory' class='on-heap-memory-checkbox-div'><input type='checkbox' class='toggle-vis' data-sum-col-idx='3' data-exec-col-idx='5'>On Heap Memory</div>" +
"<div id='off_heap_memory' class='off-heap-memory-checkbox-div'><input type='checkbox' class='toggle-vis' data-sum-col-idx='4' data-exec-col-idx='6'>Off Heap Memory</div>" +
"<div id='extra_resources' class='resources-checkbox-div'><input type='checkbox' class='toggle-vis' data-sum-col-idx='' data-exec-col-idx='9'>Resources</div>" +
"<div id='resource_prof_id' class='resource-prof-id-checkbox-div'><input type='checkbox' class='toggle-vis' data-sum-col-idx='' data-exec-col-idx='10'>Resource Prodile Id</div>" +
"</div>");

reselectCheckboxesBasedOnTaskTableState();
Expand Down
2 changes: 2 additions & 0 deletions core/src/main/scala/org/apache/spark/BarrierTaskContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,8 @@ class BarrierTaskContext private[spark] (
resources().asJava
}

override def resourceProfileId(): Int = taskContext.resourceProfileId()

override private[spark] def killTaskIfInterrupted(): Unit = taskContext.killTaskIfInterrupted()

override private[spark] def getKillReason(): Option[String] = taskContext.getKillReason()
Expand Down
31 changes: 18 additions & 13 deletions core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,29 @@ private[spark] trait ExecutorAllocationClient {
/**
* Update the cluster manager on our scheduling needs. Three bits of information are included
* to help it make decisions.
* @param numExecutors The total number of executors we'd like to have. The cluster manager
* shouldn't kill any running executor to reach this number, but,
* if all existing executors were to die, this is the number of executors
* we'd want to be allocated.
* @param localityAwareTasks The number of tasks in all active stages that have a locality
* preferences. This includes running, pending, and completed tasks.
* @param hostToLocalTaskCount A map of hosts to the number of tasks from all active stages
* that would like to like to run on that host.
* This includes running, pending, and completed tasks.
*
* @param numLocalityAwareTasksPerResourceProfileId The number of tasks in all active stages that
* have a locality preferences per
* ResourceProfile id. This includes running,
* pending, and completed tasks.
* @param hostToLocalTaskCount A map of ResourceProfile id to a map of hosts to the number of
* tasks from all active stages that would like to like to run on
* that host. This includes running, pending, and completed tasks.
* @param resourceProfileIdToNumExecutors The total number of executors we'd like to have per
* ResourceProfile id. The cluster manager shouldn't kill any
* running executor to reach this number, but, if all
* existing executors were to die, this is the number
* of executors we'd want to be allocated.
* @return whether the request is acknowledged by the cluster manager.
*/
private[spark] def requestTotalExecutors(
numExecutors: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int]): Boolean
numLocalityAwareTasksPerResourceProfileId: Map[Int, Int],
hostToLocalTaskCount: Map[Int, Map[String, Int]],
resourceProfileIdToNumExecutors: Map[Int, Int]): Boolean

/**
* Request an additional number of executors from the cluster manager.
* Request an additional number of executors from the cluster manager for the default
* ResourceProfile.
* @return whether the request is acknowledged by the cluster manager.
*/
def requestExecutors(numAdditionalExecutors: Int): Boolean
Expand Down
Loading