Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

tgravescs · 2024-06-05T14:20:33Z

This enhances heuristics around the spark.executor.memory and handles cases where the memory to core ratio is to small. It will throw an exception and not put out tunings if the core/memory ratio is to small. In the future we should just tag this and recommend the sizes.

This also adds in extra overhead since worst case we need space for both pinned memory and spill memory. It gets a little complicated since spill will use pinned memory, but if its used it will use regular off heap. So here we set things at worst case which is it needs both.

I also added in heuristics for configuring the multithreaded readers - num threads and some sizes and also the shuffle reader/writer thread pools based on the number of cores.
Most of the heuristics are based on what we saw from real customer workloads and NDS results.

Most of this testing was on CSPs, I will try to apply more to onprem later.

note most of this functionality needs the worker information passed in --worker-info ./worker_info-demo-gpu-cluster.yaml

Example:

system:
  numCores: 8 
  memory: 15360MiB
  numWorkers: 4
softwareProperties:
  spark.scheduler.mode: FAIR

With the worker info:

Spark Properties:
--conf spark.executor.cores=8
--conf spark.executor.instances=4
--conf spark.executor.memory=8192m
--conf spark.rapids.filecache.enabled=true
--conf spark.rapids.memory.pinnedPool.size=3584m
--conf spark.rapids.shuffle.multiThreaded.reader.threads=20
--conf spark.rapids.shuffle.multiThreaded.writer.threads=20
--conf spark.rapids.sql.batchSizeBytes=2147483647
--conf spark.rapids.sql.concurrentGpuTasks=3
--conf spark.rapids.sql.multiThreadedRead.numThreads=20
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
--conf spark.sql.adaptive.coalescePartitions.minPartitionSize=4m
--conf spark.sql.adaptive.coalescePartitions.parallelismFirst=false
--conf spark.sql.shuffle.partitions=200
--conf spark.task.resource.gpu.amount=0.125

Without the worker info:

Spark Properties:
--conf spark.rapids.filecache.enabled=true
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
--conf spark.sql.shuffle.partitions=200

Signed-off-by: Thomas Graves <[email protected]>

amahussein · 2024-06-05T18:12:16Z

--worker-info ./worker_info-demo-gpu-cluster.yaml

Should we add that file to the repo? Perhaps inside inside tests/resources?

tgravescs · 2024-06-05T20:14:28Z

Should we add that file to the repo? Perhaps inside inside tests/resources?

Sure I can add it.

I also realized I wanted to add a few more tests to the Suite so I'll do that and push some updates shortly.

amahussein

Thanks @tgravescs

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/AutoTuner.scala

amahussein

I see you filed #1078

Thanks @tgravescs
LGTME

nartal1

Thanks @tgravescs !

tgravescs and others added 23 commits May 20, 2024 10:35

Moving base classes around to separate Qual and Profile auto tuner

3358c35

Signed-off-by: Thomas Graves <[email protected]>

checkpoint before pulling in pr

1177f7b

Patch from pr 1021

fddd5d0

Updated and working

23e3326

checkpoint

233e459

checkpoint

42641f0

Merge remote-tracking branch 'origin/dev' into autotunerqualupmerge

a85b402

remove unneeded logs

394f225

cleanup

eed81e5

comments

46243c2

Fix tests for changes in default GPU type

109d4f0

fix order imports

e7d079c

fix missing space

16978ff

Change heuristics in Autotuner

c8a882f

Signed-off-by: Thomas Graves <[email protected]>

Calculate heap memory to account for different scenarios

ffa2829

Signed-off-by: Thomas Graves <[email protected]>

fixes

56bf1bc

remove the DEF_SYSTEM_RESERVE_MB subtraction for standalone mode

f825b3b

fix double issue

a8d64e7

Merge remote-tracking branch 'origin/dev' into autoheur

1a1ff0e

updates

74de30e

add comments

ce9918a

update tests

36d0781

split apart functions

fefe394

tgravescs added the feature request New feature or request label Jun 5, 2024

tgravescs self-assigned this Jun 5, 2024

add a few more tests

c13f692

amahussein added the core_tools Scope the core module (scala) label Jun 6, 2024

amahussein reviewed Jun 6, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/AutoTuner.scala Show resolved Hide resolved

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/AutoTuner.scala Show resolved Hide resolved

amahussein approved these changes Jun 6, 2024

View reviewed changes

nartal1 approved these changes Jun 6, 2024

View reviewed changes

tgravescs merged commit baa0f46 into NVIDIA:dev Jun 7, 2024
15 checks passed

tgravescs deleted the autoheur branch June 7, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

tgravescs commented Jun 5, 2024

amahussein commented Jun 5, 2024

tgravescs commented Jun 5, 2024

amahussein left a comment

amahussein left a comment

nartal1 left a comment

Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

Conversation

tgravescs commented Jun 5, 2024

amahussein commented Jun 5, 2024

tgravescs commented Jun 5, 2024

amahussein left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment