Skip to content

HASTE-project/bin-packing-paper

Repository files navigation

Scripts, Code, etc. for Bin Packing Paper

Deployment

Master 1x ssc.xlarge Worker 5x ssc.xlarge Source 1x ssc.small

Dataset

See 'box'.

268 x 2.6MB = 696.8MB

HIO

Docker Hub salmantoor/cellprofiler:3.1.9

Spark

SNIC 2019/10-33 (UPPMAX)

Worker: 130.238.28.97

Streaming Source: 130.238.28.96

Can't use Python API. https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html "fileStream is not available in the Python API; only textFileStream is available." (instead made Scala app)

Sync the Clock: sudo ntpdate -v 0.se.pool.ntp.org

To run the benchamarking:

  1. fix hostname on driver (master):

sudo hostname 192.168.1.15

(for reverse dns 'connect back')

  1. Start Spark, as appropriate, on the different machines:

./spark-2.4.4-bin-hadoop2.7/sbin/start-master.sh ./spark-2.4.4-bin-hadoop2.7/sbin/start-shuffle-service.sh ./spark-2.4.4-bin-hadoop2.7/sbin/start-slave.sh spark://192.168.1.15:7077

Check in Web GUI that cluster and all workers are up.

  1. Stop any existing app (kill via web UI)

  2. Clear source directory.

cd /mnt/images/Salman_Cell_profiler_data/Data ; rm src/* ; mkdir src

  1. Begin profiling:

# profile CPU (run this on each worker)
# gotcha, see: https://serverfault.com/questions/436446/top-showing-64-idle-on-first-screen-or-batch-run-while-there-is-no-idle-time-a
rm cpu.log ; while : ; do  echo "$(top -b -n2 -d 0.1 |grep "Cpu(s)"|tail -n 1) -  $(($(date +%s%N)/1000000))mS since epoch"; sleep 0.5; done >> cpu.log


# Poll the total number of cores (just do this on the master)
# Need to first get AppID from web interface
###rm cores.log ; while : ; do  echo "$(curl -s http://localhost:4040/api/v1/applications/app-20200218180401-0003/executors | jq 'reduce .[] as $item (0; . + $item.totalCores)') -  $(($(date +%s%N)/1000000))mS since epoch"; sleep 1; done >> cores.log
# OBSELETE... do it like this to get per-core count:


# executors per machine
rm cores.log ; while : ; do  echo "$(curl -s http://localhost:8080 | grep -o -E '<td>8 \([0-9] Used\)</td>' | cut -c8 | tr '\n' ',') -  $(($(date +%s%N)/1000000))mS since epoch"; sleep 1; done >> cores.log
  1. Launch the app.

See: spark/spark-scala-cellprofiler/deploy_spark_app.bash

  1. Copy in some files On the source machine:

Copy 1 file

rm src/* ; cd /mnt/images/Salman_Cell_profiler_data/Data ; cp Nuclear\ images/011001-1-001001001.tif src/

Copy all files

rm src/* ; cd /mnt/images/Salman_Cell_profiler_data/Data ; cp Nuclear\ images/* src/

Copy all files with a small pause (this helps spark to create smaller batches - so it can scale in a timely fashion

echo $(($(date +%s%N)/1000000))mS since epoch; cd /mnt/images/Salman_Cell_profiler_data/Data ; rm src/* ; find ./Nuclear\ images -name "*.tif" -type f | xargs -I file sh -c "(cp "file" ./src; sleep 0.1)"; echo $(($(date +%s%N)/1000000))mS since epoch

copy 20 images

cd /mnt/images/Salman_Cell_profiler_data/Data ; rm src/* ; find ./Nuclear\ images -name "*.tif" -type f | head -n 20 | xargs -I file sh -c "(cp "file" ./src; sleep 0.1)"

  1. Let it finish. Let it scale down. Stop the app.

  2. Stop the profiling. Download the data.

Move files in data/spark to new dir.

Then:

mkdir spark/data/master
rsync -z 130.238.28.97:~/*.log spark/data/master
mkdir spark/data/worker1
rsync -z 130.238.28.106:~/*.log spark/data/worker1
mkdir spark/data/worker2
rsync -z 130.238.28.86:~/*.log spark/data/worker2
mkdir spark/data/worker3
rsync -z ben-spark-worker-1:~/*.log spark/data/worker3
mkdir spark/data/worker4
rsync -z 130.238.28.59:~/*.log spark/data/worker4
mkdir spark/data/worker5
# worker 5 doesn't have public IP
rsync -z ben-spark-worker-2-4:~/*.log spark/data/worker5

Runs

2019-11-20 -- trial run? 2019-11-21 -- run for CCGrid submission. 2020-02-18 -- upped concurrent jobs setting from 3 to 40. 2020-02-20 -- recording per-node executor count also.

2020-04-05 -- performance fix: only filenames in RDD. Copy started 1588707496409mS. Copy ended 1588707586693mS. Last processing apx. 1588708133 Secs

MISC NOTES

TO test CP: run manually one image

cellprofiler -p /mnt/images/Salman_Cell_profiler_data/Salman_CellProfiler_cell_counter_no_specified_folders.cpproj -o ~ --file-list filelist

TODO:

  1. call Rest API to get number of cores, e.g. http://localhost:4040/api/v1/applications

    gives name of app

http://localhost:4040/api/v1/applications/app-20191119153056-0033/executors > lists number of cores for each executor

  1. enable autoscaling

TODO: scaling policy in Spark -- by default it will try to 'spread out' the executors (to maximize IO throughput, this is the opposite of what HIO does)

About

Repo for bin packing paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published