Device Aware Serialization/Deserialization #311

narendasan · 2021-02-02T22:40:05Z

narendasan
Feb 2, 2021
Collaborator

Goal

We want to be able to make TRTorch programs portable across machines that may not have the exact same GPU configuration but are still capable of running the embedded engine. Therefore we need to be able to select the correct device to deserialize the engine on. The proposed solution is to include in the serialized program information about the engine compatibility and preferences on which devices to select.

Device Information and Hierarchy

Relevant information:

Compute capability
Model information and name
Device ID
DLA or GPU

Device selection hierarchy: (1 is broadest allowable, N is most preferable)

There are two separate cases to consider, when the engine is created for a GPU and when it's created for DLA. For GPU if there is no device with the compute capability used to build the engine then the runtime should error out, with an error like Unable to deserialize trtorch program because engines were build targeting compute capability X, however no device with compute capability X available. If there is more than one device:

IF GPU:

[1] Compute capability is the same
[2] Model being the same
[3] If compatible device is current device
[4] Lowest device id

IF DLA:

[1] Compute capability is the same
[1] Device ID is the same

Compile Time:

At compile time the user is able to select the target device explicitly through existing TRTorch APIs.

User level API: `trtorch::CompileSpec::Device`

https://github.com/NVIDIA/TRTorch/blob/08b2455c3ad540017c8929bbcab4669db7bdee7d/cpp/api/include/trtorch/trtorch.h#L261

struct Device {
    class DeviceType {
     public:
      enum Value : int8_t {
        /// Target GPU to run engine
        kGPU,
        /// Target DLA to run engine
        kDLA,
      };
    };

    DeviceType device_type;
    int64_t gpu_id;
    int64_t dla_core;
    bool allow_gpu_fallback;
 };

Example Usecases

Example 1: SIngle GPU: Not set
Example 2: Dual GPU:
{device_type = DeviceType::kGPU, gpu_id = 0}
- Information to collect:
  - Compute Capability
  - GPU ID
  - Device Name
  - GPU or DLA
Example 3: DLA
{device_type = DeviceType::kDLA, dla_core = 1}
- Information to collect:
  - DLA Core
  - GPU or DLA

@andi4191: For DLA do we correct the GPU ID if its the wrong device?

Source of Truth:

Split: ConversionCtx manages active device during conversion and building, CompileGraph post conversion and build will use the data in CompileSpec.device to create a runtime::DeviceInfo struct. This will then get passed to AddEngineToGraph, which passes it to the TRTEngine constructor

Runtime Time:

At runtime the device information is mostly opaque to the user. It would be nice to have some API through PyTorch to move engines around but for the time being the device the engine is deserialized on is where it will run.

DeviceInfo Struct (a field in TRTEngine):

struct DeviceInfo {
    uint64_t device_id //This is the device the runtime has selected after deserialization 
    Compute Capability
    Model Name
    Device Type: DLA or GPU 
}

User level API:

Ideally this could be done through PyTorch APIs:

trtorch.set_device("0") 
trt_mod = torch.jit.load(“serialized.trt.ts”) # Deserializes on Device 0
# trt_mod is a torch.jit.ScriptModule
trt_mod.to(“cuda:1”) # Should now be on device 1 given it is possible

Potential WAR: Overwrite load

trtorch.load(“serialized.trt.ts”, “cuda:0”)

where

def load(path, device):
    d = parse_device_str(device)
    trtorch.set_device(d)
    trt_mod = torch.jit.load(s) # This will error out if set device is invalid GPU 
    Return trt_mod

Source of Truth: `core::runtime::TRTEngine` (device_info)

Execution: (`execute_engine`)

If current device != the device id in device_info then set to correct device. If input data is on the wrong device then we will throw a warning about having to move data and move tensors to the correct device via aten::Tensor::to. Output tensors should be created on the target device.

Serialization Deserialization (`torch.jit.save`/`torch.jit.load`)

Serialization Format:

Note: These changes could potentially cause breaking changes with existing trtorch programs, we need to either note this in release notes or develop a migration strategy.

std::vector<std::string> length 2 (This should be asserted in the deserializer)
- 0: serialized engine
- 1: serialized device struct

Indexes for serialized format should be #defined or enum’d formally within the codebase

Serialization:

Takes TRTEngine object, serializes CUDA engine via TRT API and device_info via serialize_device_info.

Deserialization:

Takes a std::vector<std::string> len 2, passes values to TRTEngine constructor. TRTEngine deserialization constructor will set the correct device before deserializing cuda engine, checking if there is an available device that matches device specifications. We use the rules defined above to select the correct device to deserialize on. To break ties right now we we use lowest device ID. End result TRTEngine is constructed, with execution context targeting the specified GPU in device_info.

Next Steps

We need to think about how to give users control of where engines are located at runtime

narendasan · 2021-02-09T21:34:05Z

narendasan
Feb 9, 2021
Collaborator Author

@andi4191 What if we looked at current device information to select device? Like say a module is on device 0 then we target device 0 unless it is specified otherwise in the compile spec since we have the device info from the provided module? This would make the device field optional, give us a strong default that makes sense in terms of pytorch and how it works.

0 replies

narendasan · 2021-02-09T22:19:12Z

narendasan
Feb 9, 2021
Collaborator Author

@andi4191 FYI pytorch/pytorch#51994

0 replies

AbhigyanKarmakarDeepEdge · 2021-08-26T01:45:04Z

AbhigyanKarmakarDeepEdge
Aug 26, 2021

Any progress on this functionality? Does it work?

1 reply

narendasan Aug 26, 2021
Collaborator Author

It was released in TRTorch v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Aware Serialization/Deserialization #311

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Device Aware Serialization/Deserialization #311

narendasan Feb 2, 2021 Collaborator

Goal

Device Information and Hierarchy

Device selection hierarchy: (1 is broadest allowable, N is most preferable)

Compile Time:

User level API: trtorch::CompileSpec::Device

Example Usecases

Source of Truth:

Runtime Time:

User level API:

Source of Truth: core::runtime::TRTEngine (device_info)

Execution: (execute_engine)

Serialization Deserialization (torch.jit.save/torch.jit.load)

Serialization Format:

Serialization:

Deserialization:

Next Steps

Replies: 3 comments · 1 reply

narendasan Feb 9, 2021 Collaborator Author

narendasan Feb 9, 2021 Collaborator Author

AbhigyanKarmakarDeepEdge Aug 26, 2021

narendasan Aug 26, 2021 Collaborator Author

narendasan
Feb 2, 2021
Collaborator

User level API: `trtorch::CompileSpec::Device`

Source of Truth: `core::runtime::TRTEngine` (device_info)

Execution: (`execute_engine`)

Serialization Deserialization (`torch.jit.save`/`torch.jit.load`)

Replies: 3 comments 1 reply

narendasan
Feb 9, 2021
Collaborator Author

narendasan
Feb 9, 2021
Collaborator Author

AbhigyanKarmakarDeepEdge
Aug 26, 2021

narendasan Aug 26, 2021
Collaborator Author