Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu support #537

Merged
merged 14 commits into from
May 31, 2017
Merged

Add gpu support #537

merged 14 commits into from
May 31, 2017

Conversation

jyundt
Copy link
Contributor

@jyundt jyundt commented Apr 17, 2017

This pull request adds support for capturing GPU information in collins. Currently, it is hardcoded for NVIDIA GPUs.

I wasn't clear if the definitions in AsssetMeta were still in use, so I added GPU_COUNT and GPU_DESCRIPTION to both AssetMeta.scala and the asset_meta table.

Jacob Yundt and others added 4 commits April 14, 2017 13:46
Updating the lshw parser to capture GPU information, WIP
This renders the proper number of GPUs and also correctly shows the
description if the Server has multiple types or an onboard graphics card.
Only add NVIDIA GPUs to asset list. Previously all display adapters (including the onboard VGA adapters) were listed as GPUs.
@michaeljs1990
Copy link
Contributor

michaeljs1990 commented Apr 17, 2017

Since we couldn't find a great way to separate the onboard graphics card from the external GPUs (The lshw xml is identical) we do have it hard coded currently. I am not sure how this should be handled since if we just allow it to parse all the GPUs every box would show up with a count of one and then you get into the case of checking NUM_NVIDIA_GPUS and NUM_AMD_GPUS and whatever other flavor you have.

The whitelist was an easy fix for us but possibly not a great way to handle it for a wider audience.

cc @byxorna @roymarantz @defect

@michaeljs1990 michaeljs1990 mentioned this pull request Apr 17, 2017
Create two new tests to verify GPU functionality works as expected.
@byxorna
Copy link
Contributor

byxorna commented Apr 20, 2017

@jyundt @michaeljs1990 a common pattern is to make a configuration array of strings to match against LSHW output to identify external graphics adapters (by product, vendor, etc). Up to you if you wanna make it opt-in or opt-out. I would love this to be configurable via production.conf, with a sane default like matching on Manufacturer field with "AMD","NVIDIA"

Copy link
Contributor

@byxorna byxorna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, thanks for the diff! Comments inline

@@ -58,6 +58,9 @@ INSERT INTO asset_meta VALUES (31, 'LLDP_VLAN_NAME', -1, 'LLDP VLANE Name', 'VLA
INSERT INTO asset_meta VALUES (32, 'INTERFACE_NAME', -1, 'Interface Name', 'Name of physical interface, e.g. eth0');
INSERT INTO asset_meta VALUES (33, 'INTERFACE_ADDRESS', 0, 'IP Address', 'Address on interface, e.g. 10.0.0.1');

INSERT INTO asset_meta VALUES (34, 'GPU_COUNT', -1, 'GPU Count', 'Number of physical GPUs in asset');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this should actually be a new migration; once an evolution is committed, we never update it. You should be able to upgrade your DB schema by only running the newest evolutions. Please move these into a $(n+1).sql, thanks

@@ -198,6 +198,10 @@ object AssetMeta extends Schema with AnormAdapter[AssetMeta] with AssetMetaKeys
val NicName = Value(32, "INTERFACE_NAME")
// DO NOT USE - Deprecated
val NicAddress = Value(33, "INTERFACE_ADDRESS")

//I know this says Deprecated, but I'm unsure of where to define Gpu vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this not compile without this? I would like to not add magic incantations places if it isnt strictly necessary :)

if (descr.isEmpty) {
seq
} else {
Gpu(descr, "", "") +: seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these last 2 params ""?

@@ -86,6 +92,41 @@ object LshwHelper extends CommonHelper[LshwRepresentation] {
)
}

protected def reconstructGpu(meta: Map[Int, Seq[MetaWrapper]]): FilteredSeq[Gpu] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is copy+paste from another reconstruct function? its really hard to read :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I modeled reconstructGpu and collectGpus off of the existing CPU/memory/disk/NIC functions. I wasn't sure how consistent you'd want the new GPU code to be compared to the previous CPU/memory/disk/NIC stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this would be made more legible to make support burden easier going forward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, I didn't change this with my most recent commit (d019c27) but I'll get this cleaned up.

@@ -79,6 +82,13 @@ class LshwParser(txt: String) extends CommonParser[LshwRepresentation](txt) {
Cpu(cores, threads, speed, asset.description, asset.product, asset.vendor)
}

val gpuMatcher: PartialFunction[NodeSeq, Gpu] = {
case n if ((n \ "@class" text) == "display" && (n \\ "vendor" text).contains("NVIDIA Corporation")) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes, we cant have this. We should check a config parameter that has a list of valid GPU vendors instead

@@ -2,9 +2,9 @@

# --- !Ups

UPDATE asset_meta SET priority = -1 WHERE id=33;
UPDATE asset_meta SET priority = -1 WHERE id=35;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎
dont modify existing evolutions

@@ -492,5 +492,18 @@ class LshwParserSpec extends mutable.Specification {
}
} // wonky opterons

"Parse GPU server" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have many new GPU-enabled LSHW fixtures to test this functionality. Can you commit a bunch of sample LSHWs from different hw to have a more robust test suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on this, but it will probably have to be a separate PR.

We (and by "we" I mean @michaeljs1990) found that our lshw XML files broke all of the existing model tests in LshwHelperSpec. In order to get the new GPU code to pass the tests we had to "add" GPUs to the existing XML test files.

Unfortunately it looks like there is something unique about our hardware. But with the exception of the model tests, everything else seems to work fine with our lshw XML data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... ok. I would definitely like any new functionality (especially that which has the potential to break existing functionality) to have unit test coverage before landing. Are you cool with that??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have clarified: I added two additional tests for LshwHelperSpec and LshwParserSpec to test the new GPU functionality. These new tests (and corresponding lshw XML files) work as expected.

However, for some reason the lshw XML output from all of our new GPU servers will fail all of the existing collins tests. I validated this on the current master branch and add_gpu_support branch.

I've been poking around with this, but there appears to be something 'funky' with our hardware, not the recent changes (or the tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, I grabbed lshw output from a Dell device with one gpu (not one of our regular GPU servers) and added two new tests (there are four GPU tests total). Everything works as expected: 7212e38

I'm reasonably confident that there is something off with our GPU servers / our lshw output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the tests are failing because you are asserting a Vector()==List(). @jyundt

Expected: LshwRepresentation(Vector(Cpu(8,16....
Actual:      LshwRepresentation(List     (Cpu(8,16...

I think if you sort that out (and ensure that equals() on Cpu and Gpu classes work with value equality and not object identity), it should work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the vector vs list discrepancy previously, but I just stumbled onto the problem: our CPU frequency is different per processor unlike the existing XML test cases.

I manually edited <size units="Hz">###</size> from one of our sample XML files and all tests are now passing from our funky gpu servers.

Interestingly enough, I think this problem would have bubbled up previously if lshw-b0216.xml was used as a test case for the LshwHelperSpec test (instead of just LshwParserSpec).

Thanks for the help, lemme take another look at this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@byxorna I poked around this a bit more and I think our usage of dynamic frequency scaling is causing this behavior. The reported CPU speed is "changing" in the lshw XML files depending on the load of the system.

Additionally, it looks like the current default behavior is to capture CPU information from the first socket and persist that information for all sockets.

As a result of these two behaviors, the existing LshwHelperSpec tests will always fail.

TBH, I'm not sure what is "expected behavior" for this. Should collins assume the max clock speed for CpuSpeedGhz for all sockets? Should collins honor the data in the lshw XML files and permit different clock speeds per socket?

I'm leaning towards opening a separate GH issue for this because it is unrelated to any of the GPU changes.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting more assumptions into collins to address [mis]configuration of servers during lshw upload is a road we do not want to go further down. IMHO, speed stepping should be disabled by genesis (if it isnt already) before lshw collection, as a workaround. This definitely merits more discussion though, because of the assumptions baked into collins. Could you open an issue and tag me in it? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call with disabling speed stepping, I hadn't thought of that.

New GH issue: #550

Jacob Yundt and others added 3 commits April 20, 2017 16:33
* Update evolutions migration to avoid hardcoding index values.

* Remove additional hardcoded GPU AssetMeta references.

* Add initial gpu whitelist config file functionality.
Copy link
Contributor Author

@jyundt jyundt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! As per your comments I removed any references to the deprecated "static" AssetMeta references and corrected the evolutions SQL migration scripts.

I also added initial functionality for a "whitelist" for gpuVendors defined in the collins config file.

Let me know your thoughts?

@@ -492,5 +492,18 @@ class LshwParserSpec extends mutable.Specification {
}
} // wonky opterons

"Parse GPU server" in {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on this, but it will probably have to be a separate PR.

We (and by "we" I mean @michaeljs1990) found that our lshw XML files broke all of the existing model tests in LshwHelperSpec. In order to get the new GPU code to pass the tests we had to "add" GPUs to the existing XML test files.

Unfortunately it looks like there is something unique about our hardware. But with the exception of the model tests, everything else seems to work fine with our lshw XML data.

@@ -86,6 +92,41 @@ object LshwHelper extends CommonHelper[LshwRepresentation] {
)
}

protected def reconstructGpu(meta: Map[Int, Seq[MetaWrapper]]): FilteredSeq[Gpu] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I modeled reconstructGpu and collectGpus off of the existing CPU/memory/disk/NIC functions. I wasn't sure how consistent you'd want the new GPU code to be compared to the previous CPU/memory/disk/NIC stuff.

@michaeljs1990
Copy link
Contributor

https://github.com/tumblr/collins/blob/master/support/ruby/collins-client/lib/collins/asset.rb#L206 will want to add gpus to this as well for consistency and so it's easy to access from the ruby client. Not sure if other stuff will need to be updated as well.

@michaeljs1990
Copy link
Contributor

Not sure what this is about... but I was testing this PR inside the provided dockerfile for this repo and even adding the additional config needed for gpuVendors to the production.conf file it will not intake GPUs but does show the additional field in hardware.

Likely doing something really dumb but should likely figure out exactly what that dumb thing is :/

@jyundt
Copy link
Contributor Author

jyundt commented May 5, 2017

Not sure what this is about... but I was testing this PR inside the provided dockerfile for this repo and even adding the additional config needed for gpuVendors to the production.conf file it will not intake GPUs but does show the additional field in hardware.

I think I found the culprit, the docker specific validations.conf file is missing the GpuConfig.

I'll validate and update my PR.

@michaeljs1990
Copy link
Contributor

michaeljs1990 commented May 9, 2017

FWIW we ended up having to deploy this which just happens to be the current master branch with this deployed on-top of it. It's working well so far with the exception that we had to install the NVIDIA driver for our genesis iso to get access to some extra information about the GPUs in /proc/sys by default the lshw returns "NVIDIA Corporation" for product and vendor.

Our current plan is spicing up the lshw xml for product by injecting the actual model name found in /proc/sys/**/information (can't remember exact location right now).

I should say anything older than the Titan X models do return same product information.

@michaeljs1990
Copy link
Contributor

Something strange is happening on update. I checked by uploading a valid XML file with GPUs and then changing the name of a few things such as the NIC and GPU product. The NIC is updated fine and has a new name however it seems GPU is never updated.

The farthest I have been able to trace it back is to LshwHelper.updateAsset however it seems to have the correct XML at this point.

@byxorna
Copy link
Contributor

byxorna commented May 18, 2017

@jyundt is your latest update fixing the odd behavior that @michaeljs1990 pointed out? Is this diff ready for review now? I want to make sure people dont spend time reviewing a WIP

Thanks for the diff, this is gonna be awesome!!!

Copy link
Contributor

@byxorna byxorna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice update! Looking forward to landing this :)

@@ -206,9 +206,15 @@ object AssetMeta extends Schema with AnormAdapter[AssetMeta] with AssetMetaKeys
val BaseProduct = findOrCreateFromName("BASE_PRODUCT")
val BaseVendor = findOrCreateFromName("BASE_VENDOR")
val BaseSerial = findOrCreateFromName("BASE_SERIAL")
val GpuCount = findOrCreateFromName("GPU_COUNT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be computed on reconstruction, instead of storing this? It would make it easier to compute how many NVIDIA vs AMD GPUs there are (if in a mixed env)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I removed the extraneous GPU_COUNT

@@ -206,9 +206,15 @@ object AssetMeta extends Schema with AnormAdapter[AssetMeta] with AssetMetaKeys
val BaseProduct = findOrCreateFromName("BASE_PRODUCT")
val BaseVendor = findOrCreateFromName("BASE_VENDOR")
val BaseSerial = findOrCreateFromName("BASE_SERIAL")
val GpuCount = findOrCreateFromName("GPU_COUNT")
val GpuDescription = findOrCreateFromName("GPU_DESCRIPTION")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have both vendor and description? I believe pci enumeration has fields for both vendor and description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added product and vendor. It's a bit confusing because lshw refers to the "label" or "description" of the PCI device as the "product".

@@ -204,6 +204,14 @@
<td>@{if (aa.lshw.hasHyperthreadingEnabled) "Yes" else "No"}</td>
</tr>

<th colspan="3">GPU</th>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace is off here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, good catch.

@aa.lshw.gpus.zipWithIndex.map { case(gpu,id) =>
<tr>
<th>@id</th>
<td>@gpu.description</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add vendor column here, and maybe sort by id, so the GPUs show up in enumeration order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Description (internally called "product") and Vendor columns.

override val namespace = "gpu"
override val referenceConfigFilename = "gpu_reference.conf"

def gpuVendors= getStringSet("gpuVendors")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the default values to be reflected here, instead of relying on the config to include something. I.e. the default unpopulated config should include nvidia, instead of the user being forced to set this config on every installation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added default value of "NVIDIA Corporation".

@@ -282,6 +282,10 @@ lshw {
lshw.defaultNicCapacity=10000000000
}

gpu {
gpuVendors = ["NVIDIA Corporation"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stutter. gpu.gpuVendors? Maybe gpu.supportedVendorStrings or something to be more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, renamed gpuVendors -> supportedVendorStrings

- Split 'description' -> 'product' and 'vendor' based on PCI enumeration
- Remove static GPU_COUNT
- Cleanup hwdetails view to include GPU product, GPU vendor
- Modify hwdetails view to only display GPUs info if GPUs are present
- Whitelist NVIDIA GPUs by default
- Rename GPU whitelist config param to avoid stutter
@jyundt
Copy link
Contributor Author

jyundt commented May 19, 2017

@jyundt is your latest update fixing the odd behavior that @michaeljs1990 pointed out? Is this diff ready for review now? I want to make sure people dont spend time reviewing a WIP

Yes, I incorporated the change from @michaeljs1990 with f301fd9.

My most recent commit (d019c27) should incorporate all of the requested changes except cleaning up reconstructGpu (it's still messy).

New lshw output provided from a Dell device with one GPU.
@@ -105,6 +113,7 @@ case class LshwRepresentation(
(cpuCoreCount == other.cpuCoreCount) &&
(cpuThreadCount == other.cpuThreadCount) &&
(cpuSpeed == other.cpuSpeed) &&
(gpuCount == other.gpuCount) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure where this method gets used, but comparing GPU counts (as opposed to deep comparison of desc/vendor/product) could produce odd behavior when you change GPUs and update LSHW, but the same number of GPUs are present. Would collins not update the meta attributes in the DB if the old reconstructed LSHW equals the newly submitted one?

I think what you have in this diff is fine, but we should test this behavior (induct with GPU A, change to GPU B, induct, ensure LSHW is updated properly) and open an issue if this is a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was something I intended to remove but missed.

override val namespace = "gpu"
override val referenceConfigFilename = "gpu_reference.conf"

def supportedVendorStrings= getStringSet("supportedVendorStrings", Set("NVIDIA Corporation"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the default string to the reference config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH i feel like the reference config is a confusing layer of indirection. When trying to figure out how collins gets some default config setting, you have to unwind the magic config loading behavior (reference configs, included base configs, environment configs, then application.conf) to understand what gets what from where. I personally prefer (and recommended) to have default config directly in the code


# --- !Ups

INSERT INTO asset_meta (name, priority, label, description) VALUES ('GPU_PRODUCT', -1, 'GPU Product', 'GPU product (description of GPU)');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt this need a GPU_DESCRIPTION as well? I see we extract description from LSHW, but it doesnt seem like it ever gets persisted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted GPU_DESCRIPTION because I didn't see any value in storing it in the database. Based on all of the lshw dumps I could find, the description for any display adapter (including onboard adapters) will be "VGA compatible controller".

But I don't really feel too strongly about it and could add it if you think there is value in persisting this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that is fair. If its not a useful field, lets ignore it. I was thinking it could contain some more useful information, but apparently not! 💃

Copy link
Contributor

@byxorna byxorna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Thanks so much for this!

Few comments inline, namely around making sure you persist GPU_DESCRIPTION if you are extracting it. More tests/validation would be great, but this looks like it is close. Ill add in more people to review now.

@byxorna byxorna requested review from defect and roymarantz May 20, 2017 09:18
Copy link
Contributor

@byxorna byxorna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as approved to remove the blocker :)

@byxorna
Copy link
Contributor

byxorna commented May 23, 2017

I think this is good to land. @jyundt can you open a new diff for gh-pages to document the new configuration bits for the GPU section?

@jyundt
Copy link
Contributor Author

jyundt commented May 23, 2017

I think this is good to land. @jyundt can you open a new diff for gh-pages to document the new configuration bits for the GPU section?

#551, let me know if you'd like anything documented/worded differently.

@byxorna
Copy link
Contributor

byxorna commented May 23, 2017

@jyundt looks great! Lets wait for @defect and @roymarantz to review before we land?

@defect
Copy link
Contributor

defect commented May 29, 2017

One teeny tiny comment, but looks good to me :)

@byxorna
Copy link
Contributor

byxorna commented May 30, 2017

@jyundt ill land this unless you have any last change you need to make, cool?

@jyundt
Copy link
Contributor Author

jyundt commented May 31, 2017

No, no additional changes. Thanks for your help with this!

@byxorna byxorna merged commit d662de2 into tumblr:master May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants