Skip to content

Commit

Permalink
Merge pull request #74 from bjarthur/bja/lsf
Browse files Browse the repository at this point in the history
add support for LSF
  • Loading branch information
amitmurthy authored Jan 7, 2019
2 parents b495738 + 4678e28 commit 0b54922
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 0 deletions.
16 changes: 16 additions & 0 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ Support for different job queue systems commonly used on compute clusters.

| Job queue system | Command to add processors |
| ---------------- | ------------------------- |
| Load Sharing Facility (LSF) | `addprocs_lsf(np::Integer, flags=``)` or `addprocs(LSFManager(np, flags))` |
| Sun Grid Engine | `addprocs_sge(np::Integer, queue="")` or `addprocs(SGEManager(np, queue))` |
| SGE via qrsh | `addprocs_qrsh(np::Integer, queue="")` or `addprocs(QRSHManager(np, queue))` |
| PBS | `addprocs_pbs(np::Integer, queue="")` or `addprocs(PBSManager(np, queue))` |
| Scyld | `addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))` |
| HTCondor | `addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))` |
Expand Down Expand Up @@ -98,6 +100,20 @@ julia> From worker 26: lum-7-2.local
From worker 25: cheech-207-16.local
```

### SGE via qrsh

`SGEManager` uses SGE's `qsub` command to launch workers, which communicate the
TCP/IP host:port info back to the master via the filesystem. On filesystems
that are tuned to make heavy use of caching to increase throughput, launching
Julia workers can frequently timeout waiting for the standard output files to appear.
In this case, it's better to use the `QRSHManager`, which uses SGE's `qrsh`
command to bypass the filesystem and captures STDOUT directly.

### Load Sharing Facility (LSF)

`LSFManager` supports IBM's scheduler. Similar to `QRSHManager` in that it
uses the `-I` (i.e. interactive) flag to `bsub`.

### Using `LocalAffinityManager` (for pinning local workers to specific cores)

- Linux only feature.
Expand Down
1 change: 1 addition & 0 deletions src/ClusterManagers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,6 @@ include("condor.jl")
include("slurm.jl")
include("affinity.jl")
include("elastic.jl")
include("lsf.jl")

end
41 changes: 41 additions & 0 deletions src/lsf.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
export LSFManager, addprocs_lsf

struct LSFManager <: ClusterManager
np::Integer
bsub_flags::Cmd
end

function launch(manager::LSFManager, params::Dict, launched::Array, c::Condition)
try
dir = params[:dir]
exename = params[:exename]
exeflags = params[:exeflags]

np = manager.np

jobname = `julia-$(getpid())`

cmd = `$exename $exeflags $(worker_arg())`
bsub_cmd = `bsub -I $(manager.bsub_flags) -cwd $dir -J $jobname "$cmd"`

stream_proc = [open(bsub_cmd) for i in 1:np]

for i in 1:np
config = WorkerConfig()
config.io = stream_proc[i]
push!(launched, config)
notify(c)
end

catch e
println("Error launching workers")
println(e)
end
end

manage(manager::LSFManager, id::Int64, config::WorkerConfig, op::Symbol) = nothing

kill(manager::LSFManager, id::Int64, config::WorkerConfig) = kill(config.io)

addprocs_lsf(np::Integer, bsub_flags::Cmd=``; params...) =
addprocs(LSFManager(np, bsub_flags); params...)

0 comments on commit 0b54922

Please sign in to comment.