diff --git a/README.md b/README.md
index 6029258..70bb3a4 100644
--- a/README.md
+++ b/README.md
@@ -6,8 +6,8 @@ My implementation of [BiSeNetV1](https://arxiv.org/abs/1808.00897) and [BiSeNetV
 mIOUs and fps on cityscapes val set:
 | none | ss | ssc | msf | mscf | fps(fp32/fp16/int8) | link |
 |------|:--:|:---:|:---:|:----:|:---:|:----:|
-| bisenetv1 | 75.44 | 76.94 | 77.45 | 78.86 | 25/78/141 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v1_city_new.pth) |
-| bisenetv2 | 74.95 | 75.58 | 76.53 | 77.08 | 26/67/95 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v2_city.pth) |
+| bisenetv1 | 75.44 | 76.94 | 77.45 | 78.86 | 112/239/435 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v1_city_new.pth) |
+| bisenetv2 | 74.95 | 75.58 | 76.53 | 77.08 | 103/161/198 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v2_city.pth) |
 
 mIOUs on cocostuff val2017 set:
 | none | ss | ssc | msf | mscf | link |
diff --git a/tensorrt/CMakeLists.txt b/tensorrt/CMakeLists.txt
index 8ffbd39..df8925e 100644
--- a/tensorrt/CMakeLists.txt
+++ b/tensorrt/CMakeLists.txt
@@ -2,8 +2,8 @@ CMAKE_MINIMUM_REQUIRED(VERSION 3.17)
 
 PROJECT(segment)
 
-set(CMAKE_CXX_FLAGS "-std=c++14 -O2")
-set(CMAKE_NVCC_FLAGS "-std=c++14 -O2")
+set(CMAKE_CXX_FLAGS "-std=c++17 -O2")
+set(CMAKE_NVCC_FLAGS "-std=c++20 -O2")
 
 
 link_directories(/usr/local/cuda/lib64)
@@ -21,7 +21,7 @@ add_executable(segment segment.cpp trt_dep.cpp read_img.cpp)
 target_include_directories(
     segment PUBLIC ${CUDA_INCLUDE_DIRS} ${CUDNN_INCLUDE_DIRS} ${OpenCV_INCLUDE_DIRS})
 target_link_libraries(
-    segment -lnvinfer -lnvinfer_plugin -lnvparsers -lnvonnxparser -lkernels
+    segment -lnvinfer -lnvinfer_plugin -lnvonnxparser -lkernels
     ${CUDA_LIBRARIES}
     ${OpenCV_LIBRARIES})
 
diff --git a/tensorrt/README.md b/tensorrt/README.md
index 4dafb11..bf9d4fc 100644
--- a/tensorrt/README.md
+++ b/tensorrt/README.md
@@ -17,12 +17,12 @@ Then we can use either c++ or python to compile the model and run inference.
 
 #### 1. My platform
 
-* ubuntu 18.04
-* nvidia Tesla T4 gpu, driver newer than 450.80
-* cuda 11.3, cudnn 8
-* cmake 3.22.0
+* ubuntu 22.04
+* nvidia A40 gpu, driver newer than 555.42.06
+* cuda 12.1, cudnn 8
+* cmake 3.22.1
 * opencv built from source
-* tensorrt 8.2.5.1
+* tensorrt 10.3.0.26
 
 
 
@@ -39,14 +39,14 @@ This would generate a `./segment` in the `tensorrt/build` directory.
 
 #### 3. Convert onnx to tensorrt model
 If you can successfully compile the source code, you can parse the onnx model to tensorrt model with one of the following commands.   
-For fp32, command is:
-```
-$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt
-```
-If your gpu support acceleration with fp16 inferenece, you can add a `--fp16` option to in this step:  
+For fp32/fp16/bf16, command is:
 ```
+$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --fp32
 $ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --fp16
+$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --bf16
 ```
+Make sure that your gpu support acceleration with fp16/bf16 inferenece when you set these options.<br>
+
 Building an int8 engine is also supported. Firstly, you should make sure your gpu support int8 inference, or you model will not be faster than fp16/fp32. Then you should prepare certain amount of images for int8 calibration. In this example, I use train set of cityscapes for calibration. The command is like this:  
 ```
 $ rm calibrate_int8 # delete this if exists
@@ -72,26 +72,21 @@ $ ./segment test /path/to/saved_model.trt
 
 
 #### 6. Tips:  
-1. ~Since tensorrt 7.0.0 cannot parse well the `bilinear interpolation` op exported from pytorch, I replace them with pytorch `nn.PixelShuffle`, which would bring some performance overhead(more flops and parameters), and make inference a bit slower. Also due to the `nn.PixelShuffle` op, you **must** export the onnx model with input size to be *n* times of 32.~   
-If you are using 7.2.3.4 or newer versions, you should not have problem with `interpolate` anymore.
 
-2. ~There would be some problem for tensorrt 7.0.0 to parse the `nn.AvgPool2d` op from pytorch with onnx opset11. So I use opset10 to export the model.~  
-Likewise, you do not need to worry about this anymore with version newer than 7.2.3.4.
+The speed(fps) is tested on a single nvidia A40 gpu with `batchsize=1` and `cropsize=(1024,2048)`, which might be different from your platform and settings. You should evaluate the speed considering your own platform and cropsize. Also note that the performance would be affected if your gpu is concurrently working on other tasks. Please make sure no other program is running on your gpu when you test the speed.
 
-3. The speed(fps) is tested on a single nvidia Tesla T4 gpu with `batchsize=1` and `cropsize=(1024,2048)`. Please note that T4 gpu is almost 2 times slower than 2080ti, you should evaluate the speed considering your own platform and cropsize. Also note that the performance would be affected if your gpu is concurrently working on other tasks. Please make sure no other program is running on your gpu when you test the speed.
 
-4. On my platform, after compiling with tensorrt, the model size of bisenetv1 is 29Mb(fp16) and 128Mb(fp32), and the size of bisenetv2 is 16Mb(fp16) and 42Mb(fp32). However, the fps of bisenetv1 is 68(fp16) and 23(fp32), while the fps of bisenetv2 is 59(fp16) and 21(fp32). It is obvious that bisenetv2 has fewer parameters than bisenetv1, but the speed is otherwise. I am not sure whether it is because tensorrt has worse optimization strategy in some ops used in bisenetv2(such as depthwise convolution) or because of the limitation of the gpu on different ops. Please tell me if you have better idea on this.  
 
-5. int8 mode is not always greatly faster than fp16 mode. For example, I tested with bisenetv1-cityscapes and tensorrt 8.2.5.1. With v100 gpu and driver 515.65, the fp16/int8 fps is 185.89/186.85, while with t4 gpu and driver 450.80, it is 78.77/142.31. 
+### Using python (this is not updated to tensorrt 10.3)
 
+You can also use python script to compile and run inference of your model. <br>
 
-### Using python
-
-You can also use python script to compile and run inference of your model.  
+Following is still the usage method of tensorrt 8.2.<br>
 
 
 #### 1. Compile model to onnx
 
+
 With this command: 
 ```
 $ cd BiSeNet/tensorrt
diff --git a/tensorrt/batch_stream.hpp b/tensorrt/batch_stream.hpp
index 09d0262..2071204 100644
--- a/tensorrt/batch_stream.hpp
+++ b/tensorrt/batch_stream.hpp
@@ -52,7 +52,6 @@ class BatchStream : public IBatchStream
 
     void reset(int firstBatch) override
     {
-        cout << "mBatchCount: " << mBatchCount << endl;
         mBatchCount = firstBatch;
     }
 
diff --git a/tensorrt/segment.cpp b/tensorrt/segment.cpp
index a71b4ad..807f71e 100644
--- a/tensorrt/segment.cpp
+++ b/tensorrt/segment.cpp
@@ -13,6 +13,7 @@
 #include <array>
 #include <sstream>
 #include <random>
+#include <unordered_map>
 
 #include "trt_dep.hpp"
 #include "read_img.hpp"
@@ -27,8 +28,7 @@ using nvinfer1::IBuilderConfig;
 using nvinfer1::IRuntime;
 using nvinfer1::IExecutionContext;
 using nvinfer1::ILogger;
-using nvinfer1::Dims3;
-using nvinfer1::Dims2;
+using nvinfer1::Dims;
 using Severity = nvinfer1::ILogger::Severity;
 
 using std::string;
@@ -39,6 +39,7 @@ using std::vector;
 using std::cout;
 using std::endl;
 using std::array;
+using std::stringstream;
 
 using cv::Mat;
 
@@ -53,33 +54,25 @@ void test_speed(vector<string> args);
 
 
 int main(int argc, char* argv[]) {
-    if (argc < 3) {
-        cout << "usage is ./segment compile/run/test\n";
-        std::abort();
-    }
+    CHECK (argc >= 3, "usage is ./segment compile/run/test");
 
     vector<string> args;
     for (int i{1}; i < argc; ++i) args.emplace_back(argv[i]);
 
     if (args[0] == "compile") {
-        if (argc < 4) {
-            cout << "usage is: ./segment compile input.onnx output.trt [--fp16|--fp32]\n";
-            cout << "or ./segment compile input.onnx output.trt --int8 /path/to/data_root /path/to/ann_file\n";
-            std::abort();
-        }
+        stringstream ss;
+        ss << "usage is: ./segment compile input.onnx output.trt [--fp16|--fp32|--bf16|--fp8]\n"
+            << "or ./segment compile input.onnx output.trt --int8 /path/to/data_root /path/to/ann_file\n";
+        CHECK (argc >= 5, ss.str());
         compile_onnx(args);
     } else if (args[0] == "run") {
-        if (argc < 5) {
-            cout << "usage is ./segment run ./xxx.trt input.jpg result.jpg\n";
-            std::abort();
-        }
+        CHECK (argc >= 5, "usage is ./segment run ./xxx.trt input.jpg result.jpg");
         run_with_trt(args);
     } else if (args[0] == "test") {
-        if (argc < 3) {
-            cout << "usage is ./segment test ./xxx.trt\n";
-            std::abort();
-        }
+        CHECK (argc >= 3, "usage is ./segment test ./xxx.trt");
         test_speed(args);
+    } else {
+        CHECK (false, "usage is ./segment compile/run/test");
     }
 
     return 0;
@@ -87,39 +80,50 @@ int main(int argc, char* argv[]) {
 
 
 void compile_onnx(vector<string> args) {
+
     string quant("fp32");
     string data_root("none");
     string data_file("none");
-    if ((args.size() >= 4)) {
-        if (args[3] == "--fp32") {
-            quant = "fp32";
-        } else if (args[3] == "--fp16") {
-            quant = "fp16";
-        } else if (args[3] == "--int8") {
-            quant = "int8";
-            data_root = args[4];
-            data_file = args[5];
-        } else {
-            cout << "invalid args of quantization: " << args[3] << endl; 
-            std::abort();
-        }
-    } 
+    int opt_bsize = 1;
+
+    std::unordered_map<string, string> quant_map{
+        {"--fp32", "fp32"},
+        {"--fp16", "fp16"},
+        {"--bf16", "bf16"},
+        {"--fp8",  "fp8"},
+        {"--int8", "int8"},
+    };
+    CHECK (quant_map.find(args[3]) != quant_map.end(),
+        "invalid args of quantization: " + args[3]); 
+    quant = quant_map[args[3]];
+    if (quant == "int8") {
+        data_root = args[4];
+        data_file = args[5];
+    }
+
+    if (args[3] == "--int8") {
+        if (args.size() > 6) opt_bsize = std::stoi(args[6]);
+    } else {
+        if (args.size() > 4) opt_bsize = std::stoi(args[4]);
+    }
 
-    TrtSharedEnginePtr engine = parse_to_engine(args[1], quant, data_root, data_file);
-    serialize(engine, args[2]);
+    SemanticSegmentTrt ss_trt;
+    ss_trt.set_opt_batch_size(opt_bsize);
+    ss_trt.parse_to_engine(args[1], quant, data_root, data_file);
+    ss_trt.serialize(args[2]);
 }
 
 
 void run_with_trt(vector<string> args) {
 
-    TrtSharedEnginePtr engine = deserialize(args[1]);
+    SemanticSegmentTrt ss_trt;
+    ss_trt.deserialize(args[1]);
 
-    Dims3 i_dims = static_cast<Dims3&&>(
-        engine->getBindingDimensions(engine->getBindingIndex("input_image")));
-    Dims3 o_dims = static_cast<Dims3&&>(
-        engine->getBindingDimensions(engine->getBindingIndex("preds")));
-    const int iH{i_dims.d[2]}, iW{i_dims.d[3]};
-    const int oH{o_dims.d[2]}, oW{o_dims.d[3]};
+    vector<int> i_dims = ss_trt.get_input_shape();
+    vector<int> o_dims = ss_trt.get_output_shape();
+
+    const int iH{i_dims[2]}, iW{i_dims[3]};
+    const int oH{o_dims[2]}, oW{o_dims[3]};
 
     // prepare image and resize
     vector<float> data; data.resize(iH * iW * 3);
@@ -127,7 +131,7 @@ void run_with_trt(vector<string> args) {
     read_data(args[2], &data[0], iH, iW, orgH, orgW);
 
     // call engine
-    vector<int> res = infer_with_engine(engine, data);
+    vector<int> res = ss_trt.inference(data);
 
     // generate colored out
     vector<vector<uint8_t>> color_map = get_color_map();
@@ -166,6 +170,11 @@ vector<vector<uint8_t>> get_color_map() {
 
 
 void test_speed(vector<string> args) {
-    TrtSharedEnginePtr engine = deserialize(args[1]);
-    test_fps_with_engine(engine);
+    int opt_bsize = 1;
+    if (args.size() > 2) opt_bsize = std::stoi(args[2]);
+
+    SemanticSegmentTrt ss_trt;
+    ss_trt.set_opt_batch_size(opt_bsize);
+    ss_trt.deserialize(args[1]);
+    ss_trt.test_speed_fps();
 }
diff --git a/tensorrt/trt_dep.cpp b/tensorrt/trt_dep.cpp
index 71f105c..65bf161 100644
--- a/tensorrt/trt_dep.cpp
+++ b/tensorrt/trt_dep.cpp
@@ -7,6 +7,7 @@
 #include <unordered_map>
 #include <sstream>
 #include <chrono>
+#include <iterator>
 
 #include "trt_dep.hpp"
 #include "batch_stream.hpp"
@@ -23,8 +24,9 @@ using nvinfer1::IBuilderConfig;
 using nvinfer1::IRuntime;
 using nvinfer1::IExecutionContext;
 using nvinfer1::ILogger;
-using nvinfer1::Dims3;
-using nvinfer1::Dims2;
+using nvinfer1::Dims;
+using nvinfer1::Dims4;
+using nvinfer1::OptProfileSelector;
 using Severity = nvinfer1::ILogger::Severity;
 
 using std::string;
@@ -40,114 +42,123 @@ using std::array;
 Logger gLogger;
 
 
-TrtSharedEnginePtr shared_engine_ptr(ICudaEngine* ptr) {
-    return TrtSharedEnginePtr(ptr, TrtDeleter());
+
+void CHECK(bool state, string msg) {
+    if (!state) {
+        cout << msg << endl;;
+        std::terminate();
+    }
 }
 
 
-TrtSharedEnginePtr parse_to_engine(string onnx_pth, 
+
+void SemanticSegmentTrt::parse_to_engine(string onnx_pth, 
         string quant, string data_root, string data_file) {
-    unsigned int maxBatchSize{1};
-    long memory_limit = 1UL << 32; // 4G
 
     auto builder = TrtUnqPtr<IBuilder>(nvinfer1::createInferBuilder(gLogger));
-    if (!builder) {
-        cout << "create builder failed\n";
-        std::abort();
-    }
+    CHECK(static_cast<bool>(builder), "create builder failed");
 
-    const auto explicitBatch = 1U << static_cast<uint32_t>(
-            nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
     auto network = TrtUnqPtr<INetworkDefinition>(
-            builder->createNetworkV2(explicitBatch));
-    if (!network) {
-        cout << "create network failed\n";
-        std::abort();
-    }
-
-    auto config = TrtUnqPtr<IBuilderConfig>(builder->createBuilderConfig());
-    if (!config) {
-        cout << "create builder config failed\n";
-        std::abort();
-    }
+            builder->createNetworkV2(0));
+    CHECK(static_cast<bool>(network), "create network failed");
 
     auto parser = TrtUnqPtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, gLogger));
-    if (!parser) {
-        cout << "create parser failed\n";
-        std::abort();
-    }
+    CHECK(static_cast<bool>(parser), "create parser failed");
 
     int verbosity = (int)nvinfer1::ILogger::Severity::kWARNING;
-    bool state = parser->parseFromFile(onnx_pth.c_str(), verbosity);
-    if (!state) {
-        cout << "parse model failed\n";
-        std::abort();
+    bool success = parser->parseFromFile(onnx_pth.c_str(), verbosity);
+    CHECK(success, "parse onnx file failed");
+
+    if (network->getNbInputs() != 1) {
+        cout << "expect model to have only one input, but this model has " 
+            << network->getNbInputs() << endl;
+        std::terminate();
     }
+    auto input = network->getInput(0);
+    auto output = network->getOutput(0);
+    input_name = input->getName();
+    output_name = output->getName();
+
+    auto config = TrtUnqPtr<IBuilderConfig>(builder->createBuilderConfig());
+    CHECK(static_cast<bool>(config), "create builder config failed");
 
-    config->setMaxWorkspaceSize(memory_limit);
-    if ((quant == "fp16" or quant == "int8") && builder->platformHasFastFp16()) {
-        config->setFlag(nvinfer1::BuilderFlag::kFP16); // fp16
+    config->setProfileStream(*stream);
+
+    auto profile = builder->createOptimizationProfile();
+    Dims in_dims = network->getInput(0)->getDimensions();
+    int32_t C = in_dims.d[1], H = in_dims.d[2], W = in_dims.d[3];
+    Dims dmin = Dims4{1, C, H, W};
+    Dims dopt = Dims4{opt_bsize, C, H, W};
+    Dims dmax = Dims4{32, C, H, W};
+    profile->setDimensions(input->getName(), OptProfileSelector::kMIN, dmin);
+    profile->setDimensions(input->getName(), OptProfileSelector::kOPT, dopt);
+    profile->setDimensions(input->getName(), OptProfileSelector::kMAX, dmax);
+    config->addOptimizationProfile(profile);
+
+    if (quant == "fp16") { // fp16
+        if (builder->platformHasFastFp16() == false) {
+            cout << "fp16 is set, but platform does not support, so we ignore this\n";
+        } else {
+            config->setFlag(nvinfer1::BuilderFlag::kFP16); 
+        }
+    }
+    if (quant == "bf16") { // bf16
+        config->setFlag(nvinfer1::BuilderFlag::kBF16); 
     }
+    if (quant == "fp8") { // fp8
+        config->setFlag(nvinfer1::BuilderFlag::kFP8); 
+    }
+
     std::unique_ptr<IInt8Calibrator> calibrator;
-    if (quant == "int8" && builder->platformHasFastInt8()) {
-        config->setFlag(nvinfer1::BuilderFlag::kINT8); //int8
-        int batchsize = 32;
-        int n_cal_batches = -1;
-        string cal_table_name = "calibrate_int8";
-        string input_name = "input_image";
-
-        Dims indim = network->getInput(0)->getDimensions();
-        BatchStream calibrationStream(
-                batchsize, n_cal_batches, indim,
-                data_root, data_file);
-        calibrator.reset(new Int8EntropyCalibrator2<BatchStream>(
-            calibrationStream, 0, cal_table_name.c_str(), input_name.c_str()));
-        config->setInt8Calibrator(calibrator.get());
+    if (quant == "int8") { // int8
+        if (builder->platformHasFastInt8() == false) {
+            cout << "int8 is set, but platform does not support, so we ignore this\n";
+        } else {
+
+            int batchsize = 32;
+            int n_cal_batches = -1;
+            string cal_table_name = "calibrate_int8";
+
+            Dims indim = network->getInput(0)->getDimensions();
+            BatchStream calibrationStream(
+                    batchsize, n_cal_batches, indim,
+                    data_root, data_file);
+
+            config->setFlag(nvinfer1::BuilderFlag::kINT8); 
+
+            calibrator.reset(new Int8EntropyCalibrator2<BatchStream>(
+                calibrationStream, 0, cal_table_name.c_str(), input_name.c_str(), false));
+            config->setInt8Calibrator(calibrator.get());
+        }
     }
 
-    auto output = network->getOutput(0);
     // output->setType(nvinfer1::DataType::kINT32);
     output->setType(nvinfer1::DataType::kFLOAT);
 
-    cout << " start to build \n";
-    CudaStreamUnqPtr stream(new cudaStream_t);
-    if (cudaStreamCreate(stream.get())) {
-        cout << "create stream failed\n";
-        std::abort();
-    }
-    config->setProfileStream(*stream);
+    cout << "start to build \n";
 
     auto plan = TrtUnqPtr<IHostMemory>(builder->buildSerializedNetwork(*network, *config));
-    if (!plan) {
-        cout << "serialization failed\n";
-        std::abort();
-    }
+    CHECK(static_cast<bool>(plan), "build serialized engine failed");
 
-    auto runtime = TrtUnqPtr<IRuntime>(nvinfer1::createInferRuntime(gLogger));
-    if (!plan) {
-        cout << "create runtime failed\n";
-        std::abort();
-    }
+    runtime.reset(nvinfer1::createInferRuntime(gLogger));
+    CHECK(static_cast<bool>(runtime), "create runtime failed");
 
-    TrtSharedEnginePtr engine = shared_engine_ptr(
-            runtime->deserializeCudaEngine(plan->data(), plan->size()));
-    if (!engine) {
-        cout << "create engine failed\n";
-        std::abort();
-    }
+    engine.reset(runtime->deserializeCudaEngine(plan->data(), plan->size()));
+    CHECK(static_cast<bool>(engine), "deserialize engine failed");
     cout << "done build engine \n";
+}
+
 
-    return engine;
+void SemanticSegmentTrt::set_opt_batch_size(int bs) {
+    CHECK(bs > 0 and bs < 33, "batch size should be less than 32");
+    opt_bsize = bs;
 }
 
 
-void serialize(TrtSharedEnginePtr engine, string save_path) {
+void SemanticSegmentTrt::serialize(string save_path) {
 
     auto trt_stream = TrtUnqPtr<IHostMemory>(engine->serialize());
-    if (!trt_stream) {
-        cout << "serialize engine failed\n";
-        std::abort();
-    }
+    CHECK(static_cast<bool>(trt_stream), "serialize engine failed");
 
     ofstream ofile(save_path, ios::out | ios::binary);
     ofile.write((const char*)trt_stream->data(), trt_stream->size());
@@ -156,13 +167,10 @@ void serialize(TrtSharedEnginePtr engine, string save_path) {
 }
 
 
-TrtSharedEnginePtr deserialize(string serpth) {
+void SemanticSegmentTrt::deserialize(string serpth) {
 
     ifstream ifile(serpth, ios::in | ios::binary);
-    if (!ifile) {
-        cout << "read serialized file failed\n";
-        std::abort();
-    }
+    CHECK(static_cast<bool>(ifile), "read serialized file failed");
 
     ifile.seekg(0, ios::end);
     const int mdsize = ifile.tellg();
@@ -173,72 +181,59 @@ TrtSharedEnginePtr deserialize(string serpth) {
     ifile.close();
     cout << "model size: " << mdsize << endl;
 
-    auto runtime = TrtUnqPtr<IRuntime>(nvinfer1::createInferRuntime(gLogger));
-    TrtSharedEnginePtr engine = shared_engine_ptr(
-            runtime->deserializeCudaEngine((void*)&buf[0], mdsize));
-    return engine;
+    runtime.reset(nvinfer1::createInferRuntime(gLogger));
+    engine.reset(runtime->deserializeCudaEngine((void*)&buf[0], mdsize));
+
+    input_name = engine->getIOTensorName(0);
+    output_name = engine->getIOTensorName(1);
 }
 
 
-vector<int> infer_with_engine(TrtSharedEnginePtr engine, vector<float>& data) {
-    Dims3 out_dims = static_cast<Dims3&&>(
-        engine->getBindingDimensions(engine->getBindingIndex("preds")));
+vector<int> SemanticSegmentTrt::inference(vector<float>& data) {
+    Dims in_dims = engine->getTensorShape(input_name.c_str());
+    Dims out_dims = engine->getTensorShape(output_name.c_str());
+
+    const int64_t batchsize{1}, H{out_dims.d[2]}, W{out_dims.d[3]};
+    const int64_t n_classes{out_dims.d[1]};
+    const int64_t in_size{static_cast<int64_t>(data.size())};
+    const int64_t logits_size{batchsize * n_classes * H * W};
+    const int64_t out_size{batchsize * H * W};
+
+    Dims4 in_shape(batchsize, in_dims.d[1], in_dims.d[2], in_dims.d[3]);
 
-    const int batchsize{1}, H{out_dims.d[2]}, W{out_dims.d[3]};
-    const int n_classes{out_dims.d[1]};
-    const int in_size{static_cast<int>(data.size())};
-    const int logits_size{batchsize * n_classes * H * W};
-    const int out_size{batchsize * H * W};
     vector<void*> buffs(3);
     vector<int> res(out_size);
 
-    auto context = TrtUnqPtr<IExecutionContext>(engine->createExecutionContext());
-    if (!context) {
-        cout << "create execution context failed\n";
-        std::abort();
-    }
-
     cudaError_t state;
     state = cudaMalloc(&buffs[0], in_size * sizeof(float));
-    if (state) {
-        cout << "allocate memory failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
+
     state = cudaMalloc(&buffs[1], logits_size * sizeof(float));
-    if (state) {
-        cout << "allocate memory failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
+
     state = cudaMalloc(&buffs[2], out_size * sizeof(int));
-    if (state) {
-        cout << "allocate memory failed\n";
-        std::abort();
-    }
-    CudaStreamUnqPtr stream(new cudaStream_t);
-    if (cudaStreamCreate(stream.get())) {
-        cout << "create stream failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
 
     state = cudaMemcpyAsync(
             buffs[0], &data[0], in_size * sizeof(float),
             cudaMemcpyHostToDevice, *stream);
-    if (state) {
-        cout << "transmit to device failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "transmit to device failed");
 
-    context->enqueueV2(&buffs[0], *stream, nullptr);
-    // context->enqueue(1, &buffs[0], stream, nullptr);
+    auto context = TrtUnqPtr<IExecutionContext>(engine->createExecutionContext());
+    CHECK(static_cast<bool>(context), "create execution context failed");
+
+    // Dynamic shape require this setInputShape
+    bool success = context->setInputShape(input_name.c_str(), in_shape);
+    CHECK(success, "set input shape failed");
+    context->setInputTensorAddress(input_name.c_str(), buffs[0]);
+    context->setOutputTensorAddress(output_name.c_str(), buffs[1]);
+    context->enqueueV3(*stream);
     argMaxFunc(buffs[1], buffs[2], batchsize, n_classes, H * W, stream.get());
 
     state = cudaMemcpyAsync(
             &res[0], buffs[2], out_size * sizeof(int),
             cudaMemcpyDeviceToHost, *stream);
-    if (state) {
-        cout << "transmit to host failed \n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "transmit back to host failed");
 
     cudaStreamSynchronize(*stream);
 
@@ -250,62 +245,67 @@ vector<int> infer_with_engine(TrtSharedEnginePtr engine, vector<float>& data) {
 }
 
 
-void test_fps_with_engine(TrtSharedEnginePtr engine) {
-    Dims3 in_dims = static_cast<Dims3&&>(
-        engine->getBindingDimensions(engine->getBindingIndex("input_image")));
-    Dims3 out_dims = static_cast<Dims3&&>(
-        engine->getBindingDimensions(engine->getBindingIndex("preds")));
+void SemanticSegmentTrt::test_speed_fps() {
+    Dims in_dims = engine->getTensorShape(input_name.c_str());
+    Dims out_dims = engine->getTensorShape(output_name.c_str());
 
-    const int batchsize{1};
-    const int oH{out_dims.d[2]}, oW{out_dims.d[3]};
-    const int n_classes{out_dims.d[1]};
-    const int iH{in_dims.d[2]}, iW{in_dims.d[3]};
-    const int in_size{batchsize * 3 * iH * iW};
-    const int logits_size{batchsize * n_classes * oH * oW};
-    const int out_size{batchsize * oH * oW};
+    const int batchsize{opt_bsize};
+    const int64_t oH{out_dims.d[2]}, oW{out_dims.d[3]};
+    const int64_t n_classes{out_dims.d[1]};
+    const int64_t iH{in_dims.d[2]}, iW{in_dims.d[3]};
+    const int64_t in_size{batchsize * 3 * iH * iW};
+    const int64_t logits_size{batchsize * n_classes * oH * oW};
+    const int64_t out_size{batchsize * oH * oW};
 
-    auto context = TrtUnqPtr<IExecutionContext>(engine->createExecutionContext());
-    if (!context) {
-        cout << "create execution context failed\n";
-        std::abort();
-    }
+    Dims4 in_shape(batchsize, in_dims.d[1], in_dims.d[2], in_dims.d[3]);
 
     vector<void*> buffs(3);
     cudaError_t state;
     state = cudaMalloc(&buffs[0], in_size * sizeof(float));
-    if (state) {
-        cout << "allocate memory failed\n"; 
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
     state = cudaMalloc(&buffs[1], logits_size * sizeof(float));
-    if (state) {
-        cout << "allocate memory failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
     state = cudaMalloc(&buffs[2], out_size * sizeof(int));
-    if (state) {
-        cout << "allocate memory failed\n";
-        std::abort();
-    }
+    CHECK(state == cudaSuccess, "allocate memory failed");
+
+    auto context = TrtUnqPtr<IExecutionContext>(engine->createExecutionContext());
+    CHECK(static_cast<bool>(context), "create execution context failed");
+    bool success = context->setInputShape(input_name.c_str(), in_shape);
+    CHECK(success, "set input shape failed");
 
-    cout << "\ntest with cropsize of (" << iH << ", " << iW << ") ...\n";
+    cout << "\ntest with cropsize of (" << iH << ", " << iW << "), "
+        << "and batch size of " << batchsize << " ...\n";
     auto start = std::chrono::steady_clock::now();
-    const int n_loops{1000};
+    const int n_loops{2000};
     for (int i{0}; i < n_loops; ++i) {
-        // context->execute(1, &buffs[0]);
-        context->executeV2(&buffs[0]);
+        context->executeV2(buffs.data());
         argMaxFunc(buffs[1], buffs[2], batchsize, n_classes, oH * oW, nullptr);
     }
     auto end = std::chrono::steady_clock::now();
     double duration = std::chrono::duration<double, std::milli>(end - start).count();
     duration /= 1000.;
+    int n_frames = n_loops * batchsize;
     cout << "running " << n_loops << " times, use time: "
         << duration << "s" << endl; 
-    cout << "fps is: " << static_cast<double>(n_loops) / duration << endl;
-
+    cout << "fps is: " << static_cast<double>(n_frames) / duration << endl;
 
     cudaFree(buffs[0]);
     cudaFree(buffs[1]);
     cudaFree(buffs[2]);
 }
 
+
+vector<int> SemanticSegmentTrt::get_input_shape() {
+
+    Dims i_dims = engine->getTensorShape(input_name.c_str());
+    vector<int> res(i_dims.d, i_dims.d + 4);
+    return res;
+}
+
+
+vector<int> SemanticSegmentTrt::get_output_shape() {
+
+    Dims o_dims = engine->getTensorShape(output_name.c_str());
+    vector<int> res(o_dims.d, o_dims.d + 4);
+    return res;
+}
diff --git a/tensorrt/trt_dep.hpp b/tensorrt/trt_dep.hpp
index 2b794dc..2923316 100644
--- a/tensorrt/trt_dep.hpp
+++ b/tensorrt/trt_dep.hpp
@@ -19,9 +19,13 @@ using std::endl;
 
 using nvinfer1::ICudaEngine;
 using nvinfer1::ILogger;
+using nvinfer1::IRuntime;
 using Severity = nvinfer1::ILogger::Severity;
 
 
+void CHECK(bool success, string msg);
+
+
 class Logger: public ILogger {
     public:
         void log(Severity severity, const char* msg) noexcept override {
@@ -53,12 +57,46 @@ using TrtSharedEnginePtr = std::shared_ptr<ICudaEngine>;
 extern Logger gLogger;
 
 
-TrtSharedEnginePtr shared_engine_ptr(ICudaEngine* ptr);
-TrtSharedEnginePtr parse_to_engine(string onnx_path, string quant, 
+struct SemanticSegmentTrt {
+public:
+    TrtSharedEnginePtr engine;
+    CudaStreamUnqPtr stream;
+    TrtUnqPtr<IRuntime> runtime;
+
+    string input_name;
+    string output_name;
+    int opt_bsize{1};
+
+    SemanticSegmentTrt(): 
+        engine(nullptr), runtime(nullptr), stream(nullptr) {
+
+        stream.reset(new cudaStream_t);
+        auto fail = cudaStreamCreate(stream.get());
+        CHECK(!fail, "create stream failed");
+    }
+
+    ~SemanticSegmentTrt() {
+        engine.reset();
+        runtime.reset();
+        stream.reset();
+    }
+
+    void set_opt_batch_size(int bs);
+
+    void serialize(string save_path);
+
+    void deserialize(string serpth);
+
+    void parse_to_engine(string onnx_path, string quant, 
         string data_root, string data_file);
-void serialize(TrtSharedEnginePtr engine, string save_path);
-TrtSharedEnginePtr deserialize(string serpth);
-vector<int> infer_with_engine(TrtSharedEnginePtr engine, vector<float>& data);
-void test_fps_with_engine(TrtSharedEnginePtr engine);
+
+    vector<int> inference(vector<float>& data);
+
+    void test_speed_fps();
+
+    vector<int> get_input_shape();
+    vector<int> get_output_shape();
+};
+
 
 #endif