Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for provisioning nvidia gpu agents #211

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

j-rivero
Copy link
Contributor

@j-rivero j-rivero commented Mar 14, 2019

The PR implements the support for provisioning Nvidia GPU agents particularly the AWS nodes. Some details:

  • Add a new role called agent_gpu based on agent
  • Use lightdm to manage the launch of X server
  • Use lightdm display-setup-script feature to run an script to set proper permissions
  • Use custom xorg.conf customized to run on GRID K520 cards in a headless mode (more generic configurations did not work for me during my tests)
  • Install nvidia-docker2

Step to test it at AWS cloud:

  • Launch AMI 16.04 >> GPU instances g2.2xlarge
  • ssh -X ...
  • sudo apt-get update && sudo apt-get install -y librarian-puppet git mesa-utils && git clone https://github.com/j-rivero/buildfarm_deployment -b agent_gpu && git clone https://github.com/j-rivero/buildfarm_deployment_config -b agent_gpu && sed -i '51d' buildfarm_deployment_config/reconfigure.bash && sudo mv buildfarm_* /root/ && sudo su -
  • cd buildfarm_deployment_config
  • ./reconfigure.bash agent-gpu

TODO: for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded. It is necessary to restart the service to get it working correctly. I've tried to define the require clause explicitly to end with the service_lightdm_restart but no luck. Need your help here @nuclearsandwich

@@ -42,7 +42,7 @@
$defaults = {
'ensure' => 'present',
}
create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)
# create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reverted, only here to be able to launch the code without keys.

@nuclearsandwich nuclearsandwich self-requested a review March 14, 2019 20:05
* Use a local copy of the nvidia sources file.

Since http/https sources weren't introduced until Puppet 4.4, we can't
use them. And the wget approach is a decent fallback but I'd rather just
vendor the sources file and deal with linkrot of the apt repositories
rather than linkrot of the sources and the apt repositories.

* Notify Exec['apt_update'] from the apt package after the source has
  been added.

This should resolve race conditions between the apt source, apt key, and
apt update execution and means that the nvidia-docker2 package need only
depend on the proper source being configured.
Going forward we should store files in the module they're used.
These ordering arrows are redundant with the explicit `require`
directive in each resource. In general I prefer to use ordering arrows
only in one-liners rather than verbose resources as they get quite hard
to follow, especially when comments are interspersed.
Notifying a service triggers a refresh or restart depending on the
services capabilities. This should be sufficient to trigger a restart.
The linux-aws kernel is only necessary when using an EC2 instance.
In other systems that kernel may not be reasonable.

I'm not sure if other virtual systems will have specific kernel
requirements as I don't have any nvidia GPU + virtualizable systems to
test on. But that can be work for future PRs.
This dependency is established using both before and require on the
respective resources. Where both resources are unconditionally declared
require is preferred.
When two resources are both installed unconditionally. I arbitrarily
prefer declaring the dependency with `require` rather than `before`.

Using `before` is still needed in cases where a dependency needs to be
declared on a resource that is only installed conditionally. An example
is the linux-aws kernel needing to be available before nvidia-375 when
on an EC2 instance.
Copy link
Contributor

@nuclearsandwich nuclearsandwich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few changes that constitute base feedback. The commit messages include rational and I'm happy to explain any of them further or get countervailing feedback.

for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded

@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.

@j-rivero
Copy link
Contributor Author

j-rivero commented Apr 4, 2019

@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

@j-rivero
Copy link
Contributor Author

j-rivero commented Apr 8, 2019

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

Please note that given the permission setup by xhost.sh (called by lightdm) you need to run glxinfo from the jenkins-agent account.

@j-rivero
Copy link
Contributor Author

When trying to run the code today in a new instance, the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Could not retrieve fact='gpu_device_bus_id', resolution='<anonymous>': Could not execute 'nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"': command not found

@j-rivero
Copy link
Contributor Author

I found more problems using puppet 3.x from system on Xenial. @nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

diff --git a/modules/agent_files/templates/xorg.conf.erb b/modules/agent_files/templates/xorg.conf.erb
index a53865d..8daaac3 100644
--- a/modules/agent_files/templates/xorg.conf.erb
+++ b/modules/agent_files/templates/xorg.conf.erb
@@ -37,7 +37,7 @@ Section "Device"
     Driver         "nvidia"
     VendorName     "NVIDIA Corporation"
     BoardName      "GRID K520"
-    BusID          "<%= @facts['gpu_device_bus_id'] %>"
+    BusID          "<%= @gpu_device_bus_id %>"
 EndSection
 
 Section "Screen"
diff --git a/modules/facts/lib/facter/busid.rb b/modules/facts/lib/facter/busid.rb
deleted file mode 100644
index e05ca3a..0000000
--- a/modules/facts/lib/facter/busid.rb
+++ /dev/null
@@ -1,5 +0,0 @@
-Facter.add(:gpu_device_bus_id) do
-  setcode do
-    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
-  end
-end
diff --git a/modules/facts/lib/facter/gpu_device_bus_id.rb b/modules/facts/lib/facter/gpu_device_bus_id.rb
new file mode 100644
index 0000000..e05ca3a
--- /dev/null
+++ b/modules/facts/lib/facter/gpu_device_bus_id.rb
@@ -0,0 +1,5 @@
+Facter.add(:gpu_device_bus_id) do
+  setcode do
+    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
+  end
+end

@nuclearsandwich
Copy link
Contributor

@nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

Puppet 3.8 using the --future parser so that we're using the updated puppet language parser.

@nuclearsandwich
Copy link
Contributor

the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Facts are collected before puppet runs. Which means that I don't think we can use a fact that comes from a command installed by puppet.

I can think of a couple of really dirty solutions (like shelling out in the embedded Ruby template 😱) but it may be that we need to make it a part of the hiera configuration.

ERB files are evaluated before puppet manages the dependency
resolution of the manifests so no nvidia-xconfig is available
when the ERB file is being parsed
@j-rivero
Copy link
Contributor Author

I've back to test this PR today and everything seems to work fine except the fact that a service lightdm restartis needed after provisioning. I think that the xhost.sh script is being executed (the touch command in there leaves a file in tmp) but for some reason the X server does not have the right permissions.

@nuclearsandwich
Copy link
Contributor

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

@j-rivero
Copy link
Contributor Author

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

I moved the notify clause from lightdm.conf to xhost.sh file and make this last one to depend on the first one. No good result.

The only workaround I can see is to run the restart command in reconfigure.sh to restart lightdm after provisioning:

+if [[ $buildfarm_role == 'agent_gpu' ]]; then
+  service lightdm restart
+fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants