Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for provisioning nvidia gpu agents #211

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
be3ad7c
Implement agent_gpu profile extending agent
j-rivero Jan 25, 2019
780f9c4
Fix jenkins username
j-rivero Jan 25, 2019
7908d42
Fixes to get real support on AWS machines
j-rivero Jan 25, 2019
d81a613
Be sure that xorg.conf is not overwritten
j-rivero Jan 31, 2019
f49be15
Fix syntax typo
j-rivero Feb 5, 2019
95add6e
Do not install ubuntu-drivers-common
j-rivero Feb 11, 2019
5840a7a
Use require in lightdm service
j-rivero Feb 11, 2019
b8a27f6
Need to restart the service
j-rivero Feb 11, 2019
cf8f873
Explicit order for lightdm
j-rivero Feb 11, 2019
d092226
Add debug to the script
j-rivero Feb 25, 2019
2c94220
Update nvidia-docker2 package configuration.
nuclearsandwich Apr 3, 2019
cc7d29e
Avoid use of legacy 'agent_files' module.
nuclearsandwich Apr 3, 2019
067f994
Remove unneeded ordering arrows.
nuclearsandwich Apr 3, 2019
1ce7cb7
Notify service directly rather than via exec.
nuclearsandwich Apr 3, 2019
715f343
Only install linux-aws when on an EC2 instance.
nuclearsandwich Apr 3, 2019
03d45d9
Remove double-declared dependency.
nuclearsandwich Apr 3, 2019
d17584c
Prefer using require to before.
nuclearsandwich Apr 3, 2019
8724994
Transform xorg.conf into a template to get busid value from nvidia-xc…
j-rivero Apr 5, 2019
18e0640
Fix facter implementation
j-rivero Apr 8, 2019
927037d
value needs quotes
j-rivero Apr 8, 2019
3f32d61
Change busid by gpu_device_bus_id
j-rivero Apr 26, 2019
b92729a
Revert the use of facter to get the nvidia-xconfig value
j-rivero May 3, 2019
220a30f
remove debug touch command
j-rivero Sep 3, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions modules/agent_files/templates/xorg.conf.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Section "ServerLayout"
Identifier "Layout0"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection

Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GRID K520"
BusID "<%= @facts['gpu_device_bus_id'] %>"
nuclearsandwich marked this conversation as resolved.
Show resolved Hide resolved
EndSection

Section "Screen"
Identifier "Default Screen"
Device "Device0"
Monitor "Monitor0"
Option "AllowEmptyInitialConfiguration" "True"
EndSection
5 changes: 5 additions & 0 deletions modules/facts/lib/facter/busid.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Facter.add(:gpu_device_bus_id) do
setcode do
Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"')
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[SeatDefaults]
display-setup-script=/etc/lightdm/xhost.sh
3 changes: 3 additions & 0 deletions modules/profile/files/jenkins/agent_gpu/etc/lightdm/xhost.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh
xhost +si:localuser:jenkins-agent
touch /tmp/xhost_`date +"%T"`
j-rivero marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions modules/profile/files/jenkins/agent_gpu/nvidia-docker.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/$(ARCH) /
101 changes: 101 additions & 0 deletions modules/profile/manifests/jenkins/agent_gpu.pp
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Jenkins Agent Profile
#
# Profile class for a node configured to act as a swarm agent for Jenkins.
# This profile should only ever be declared with an include into a role or site manifest.
# Parameter overloading should be done using hiera automatic parameter lookup.
#
# @example
# include profile::jenkins::master
#
# @pararm agent_username The unix user the agent will configure and run as.
class profile::jenkins::agent_gpu {

include apt

# neeed for xhost
package { 'x11-xserver-utils' :
ensure => installed,
}

if $facts['ec2_instance_id'] {
j-rivero marked this conversation as resolved.
Show resolved Hide resolved
j-rivero marked this conversation as resolved.
Show resolved Hide resolved
package { 'linux-aws':
ensure => installed,
# When running in EC2 the AWS kernel needs to be installed before
# compiling the nvidia driver.
# TODO(nuclearsandwich) Does the xorg.conf really depend on the kernel or
# is it implicit based on drivers?
before => [ File['/etc/X11/xorg.conf'], Package['nvidia-375'] ]
}
}

package { 'xserver-xorg-dev':
ensure => installed,
}

# needs to update first the kernel and headers before
# compiling the nvidia driver
package { 'nvidia-375':
ensure => installed,
}

file { '/etc/X11/xorg.conf':
content => template('agent_files/xorg.conf.erb'),
nuclearsandwich marked this conversation as resolved.
Show resolved Hide resolved
mode => '0744',
require => [
Package[lightdm],
Package['nvidia-375'],
Package['x11-xserver-utils'],
Package[xserver-xorg-dev],
],
}

apt::key { 'nvidia_docker_key' :
source => 'https://nvidia.github.io/nvidia-docker/gpgkey',
id => 'C95B321B61E88C1809C4F759DDCAE044F796ECB0',
}

file { '/etc/apt/sources.list.d/nvidia-docker.list':
source => 'puppet:///modules/profile/jenkins/agent_gpu/nvidia-docker.list',
require => Apt::Key['nvidia_docker_key'],
notify => Exec['apt_update']
}

package { 'nvidia-docker2':
ensure => installed,
require => File['/etc/apt/sources.list.d/nvidia-docker.list']
}

package { 'lightdm':
ensure => installed,
}

file { '/etc/lightdm/xhost.sh':
source => 'puppet:///modules/profile/jenkins/agent_gpu/etc/lightdm/xhost.sh',
mode => '0744',
require => [ Package[lightdm], Package[x11-xserver-utils] ]
}

# This two rules do: check if no lightdm is present and create one
# Ensure that display-setup-script is set

file { '/etc/lightdm/lightdm.conf':
ensure => 'present',
source => 'puppet:///modules/profile/jenkins/agent_gpu/etc/lightdm/lightdm.conf',
replace => 'no', # this is the important property
require => [ File['/etc/lightdm/xhost.sh'], File['/etc/X11/xorg.conf'] ]
}

file_line { '/etc/lightdm/lightdm.conf':
ensure => present,
require => File['/etc/lightdm/lightdm.conf'],
line => 'display-setup-script=/etc/lightdm/xhost.sh',
notify => Service[lightdm],
path => '/etc/lightdm/lightdm.conf',
}

service { 'lightdm':
ensure => running,
enable => true,
hasrestart => true,
}
}
2 changes: 1 addition & 1 deletion modules/profile/manifests/ros/base.pp
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
$defaults = {
'ensure' => 'present',
}
create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)
# create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reverted, only here to be able to launch the code without keys.

}
else{
notice("No ssh_keys defined. You should probably have at least one.")
Expand Down
6 changes: 6 additions & 0 deletions modules/role/manifests/buildfarm/agent_gpu.pp
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class role::buildfarm::agent_gpu {
# Find the other instances
include profile::ros::base
include profile::jenkins::agent
include profile::jenkins::agent_gpu
}