Fix response preprocessing bug (#646)

* Fix empty response bug * Fix unused variable Fix test Initialize logger to capture logs Add unit test Change to _ instead of removing Check if args.model is not None fix artifact path Support Python 3.8 in GenAI-Perf (#643) Add automation to run unit tests and check code coverage for GenAI-Perf against Python 3.10 (#640) Changes to support Ensemble Top Level Response Caching (#560) Support for fixed number of requests (#633) * first pass. Hardcoded values * Working for concurrency (hardcoded whenever count windows is used for now) * working for req rate as well * Add CLI. Add/fix unit tests * Remove hack. Restore all normal functionality * Refactor thread config into one class. Add more testing * Rename arg to request-count * Fix request rate bug * Update info print * fix corner case * move fixme to a story tag * add assert to avoid corner case * rename variables * self review #1 * copyright changes * add doxygen to functions * Don't allow sweeping over multiple concurrency or request rate with request-count fix test (#637) Support custom artifacts directory and improve default artifacts directory (#636) * Add artifacts dir option and more descriptive profile export filename * Clean up * fix input data path * Add tests * create one to one plot dir for each profile run * change the directory look * add helper method Extend genai perf plots to compare across multiple runs (#635) * Modify PlotManager and plots classes * Support plots for multiple runs -draft * Fix default plot visualization * Remove artifact * Set default compare directory * Support generating parquet files * Remove annotations and fix heatmap * Fix errors * Fix pre-commit * Fix CodeQL warning * Remove unused comments * remove x axis tick label for boxplot * Add logging and label for heatmap subplots * Allow users to adjust width and height * fix grammer --------- Co-authored-by: Hyunjae Woo <[email protected]> Generate plot configurations for plot manager (#632) * Introduce PlotConfig and PlotConfigParser class * Port preprocessing steps and introduce ProfileRunData * Create plot configs for default plots * fix minor bug * Fix comment * Implement parse method in PlotConfigParser * refactor * fix test * Add test * Address feedback * Handle custom endpoint Add more metadata to profile export JSON file (#627) * Add more metadata to profile export data * Fix minor bug * refactor Add compare subcommand (#623) * Move for better visibility * Add compare subparser * Add subcommand compare * Fix test * Add ticket * add --files option and minor fix * Fix tests * Add unit tests * Address feedback * Fix minor error and add section header Revert "Changes to support Ensemble Top Level Response Caching (#560) (#642)" This reverts commit cc6a3b2. Changes to support Ensemble Top Level Response Caching (#560) (#642)
triton-inference-server · May 11, 2024 · a23289d · a23289d
1 parent ef114b2
commit a23289d
Show file tree

Hide file tree

Showing 53 changed files with 2,028 additions and 593 deletions.
diff --git a/.github/workflows/python-package-genai.yml b/.github/workflows/python-package-genai.yml
@@ -0,0 +1,60 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions for GenAI-Perf.
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Python package (GenAI-Perf)
+
+on:
+  pull_request:
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: ["ubuntu-22.04"]
+        python-version: ["3.8", "3.10"]
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        cd src/c++/perf_analyzer/genai-perf/
+        python -m pip install --upgrade pip
+        python -m pip install -e .
+        python -c "import genai_perf; print(genai_perf.__version__)"
+    - name: Test with pytest
+      run: |
+        pip install pytest pytest-timeout pytest-cov psutil
+        cd src/c++/perf_analyzer/genai-perf/tests
+        pytest --doctest-modules --junitxml=junit/test-results.xml --cov=genai_perf --cov-report=xml --cov-report=html --ignore-glob=test_models
diff --git a/src/c++/perf_analyzer/CMakeLists.txt b/src/c++/perf_analyzer/CMakeLists.txt
@@ -112,6 +112,7 @@ set(
   profile_data_exporter.h
   periodic_concurrency_manager.h
   periodic_concurrency_worker.h
+  thread_config.h
 )
 
 add_executable(

diff --git a/src/c++/perf_analyzer/client_backend/client_backend.h b/src/c++/perf_analyzer/client_backend/client_backend.h
@@ -138,6 +138,8 @@ enum BackendKind {
   TRITON_C_API = 3,
   OPENAI = 4
 };
+std::string BackendKindToString(const BackendKind kind);
+
 enum ProtocolType { HTTP = 0, GRPC = 1, UNKNOWN = 2 };
 enum GrpcCompressionAlgorithm {
   COMPRESS_NONE = 0,

diff --git a/src/c++/perf_analyzer/command_line_parser.cc b/src/c++/perf_analyzer/command_line_parser.cc
@@ -137,6 +137,7 @@ CLParser::Usage(const std::string& msg)
                "profiling>"
             << std::endl;
   std::cerr << "\t--percentile <percentile>" << std::endl;
+  std::cerr << "\t--request-count <number of requests>" << std::endl;
   std::cerr << "\tDEPRECATED OPTIONS" << std::endl;
   std::cerr << "\t-t <number of concurrent requests>" << std::endl;
   std::cerr << "\t-c <maximum concurrency>" << std::endl;
@@ -463,6 +464,14 @@ CLParser::Usage(const std::string& msg)
              "that the average latency is used to determine stability",
              18)
       << std::endl;
+  std::cerr
+      << FormatMessage(
+             " --request-count: Specifies a total number of requests to "
+             "use for measurement. The default is 0, which means that there is "
+             "no request count and the measurement will proceed using windows "
+             "until stabilization is detected.",
+             18)
+      << std::endl;
   std::cerr << FormatMessage(
                    " --serial-sequences: Enables serial sequence mode "
                    "where a maximum of one request is outstanding at a time "
@@ -879,6 +888,7 @@ CLParser::ParseCommandLine(int argc, char** argv)
       {"request-period", required_argument, 0, 59},
       {"request-parameter", required_argument, 0, 60},
       {"endpoint", required_argument, 0, 61},
+      {"request-count", required_argument, 0, 62},
       {0, 0, 0, 0}};
 
   // Parse commandline...
@@ -1614,6 +1624,13 @@ CLParser::ParseCommandLine(int argc, char** argv)
           params_->endpoint = optarg;
           break;
         }
+        case 62: {
+          if (std::stoi(optarg) < 0) {
+            Usage("Failed to parse --request-count. The value must be > 0.");
+          }
+          params_->request_count = std::stoi(optarg);
+          break;
+        }
         case 'v':
           params_->extra_verbose = params_->verbose;
           params_->verbose = true;
@@ -1705,6 +1722,13 @@ CLParser::ParseCommandLine(int argc, char** argv)
     // Will be using user-provided time intervals, hence no control variable.
     params_->search_mode = SearchMode::NONE;
   }
+
+  // When the request-count feature is enabled, override the measurement mode to
+  // be count windows with a window size of the requested count
+  if (params_->request_count) {
+    params_->measurement_mode = MeasurementMode::COUNT_WINDOWS;
+    params_->measurement_request_count = params_->request_count;
+  }
 }
 
 void
@@ -1874,6 +1898,31 @@ CLParser::VerifyOptions()
         "binary search mode.");
   }
 
+  if (params_->request_count != 0) {
+    if (params_->using_concurrency_range) {
+      if (params_->request_count < params_->concurrency_range.start) {
+        Usage("request-count can not be less than concurrency");
+      }
+      if (params_->concurrency_range.start < params_->concurrency_range.end) {
+        Usage(
+            "request-count not supported with multiple concurrency values in "
+            "one run");
+      }
+    }
+    if (params_->using_request_rate_range) {
+      if (params_->request_count <
+          static_cast<int>(params_->request_rate_range[0])) {
+        Usage("request-count can not be less than request-rate");
+      }
+      if (params_->request_rate_range[SEARCH_RANGE::kSTART] <
+          params_->request_rate_range[SEARCH_RANGE::kEND]) {
+        Usage(
+            "request-count not supported with multiple request-rate values in "
+            "one run");
+      }
+    }
+  }
+
   if (params_->kind == cb::TENSORFLOW_SERVING) {
     if (params_->protocol != cb::ProtocolType::GRPC) {
       Usage(

diff --git a/src/c++/perf_analyzer/command_line_parser.h b/src/c++/perf_analyzer/command_line_parser.h
@@ -62,6 +62,7 @@ struct PerfAnalyzerParameters {
   uint64_t latency_threshold_ms = NO_LIMIT;
   double stability_threshold = 0.1;
   size_t max_trials = 10;
+  size_t request_count = 0;
   bool zero_input = false;
   size_t string_length = 128;
   std::string string_data;

diff --git a/src/c++/perf_analyzer/concurrency_manager.cc b/src/c++/perf_analyzer/concurrency_manager.cc
@@ -1,4 +1,4 @@
-// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -84,10 +84,10 @@ ConcurrencyManager::InitManagerFinalize()
 
 cb::Error
 ConcurrencyManager::ChangeConcurrencyLevel(
-    const size_t concurrent_request_count)
+    const size_t concurrent_request_count, const size_t request_count)
 {
   PauseSequenceWorkers();
-  ReconfigThreads(concurrent_request_count);
+  ReconfigThreads(concurrent_request_count, request_count);
   ResumeSequenceWorkers();
 
   std::cout << "Request concurrency: " << concurrent_request_count << std::endl;
@@ -109,7 +109,8 @@ ConcurrencyManager::PauseSequenceWorkers()
 }
 
 void
-ConcurrencyManager::ReconfigThreads(const size_t concurrent_request_count)
+ConcurrencyManager::ReconfigThreads(
+    size_t concurrent_request_count, size_t request_count)
 {
   // Always prefer to create new threads if the maximum limit has not been met
   //
@@ -121,8 +122,7 @@ ConcurrencyManager::ReconfigThreads(const size_t concurrent_request_count)
          (threads_.size() < max_threads_)) {
     // Launch new thread for inferencing
     threads_stat_.emplace_back(new ThreadStat());
-    threads_config_.emplace_back(
-        new ConcurrencyWorker::ThreadConfig(threads_config_.size()));
+    threads_config_.emplace_back(new ThreadConfig(threads_config_.size()));
 
     workers_.push_back(
         MakeWorker(threads_stat_.back(), threads_config_.back()));
@@ -138,13 +138,21 @@ ConcurrencyManager::ReconfigThreads(const size_t concurrent_request_count)
     // and spread the remaining value
     size_t avg_concurrency = concurrent_request_count / threads_.size();
     size_t threads_add_one = concurrent_request_count % threads_.size();
+
+    size_t avg_req_count = request_count / threads_.size();
+    size_t req_count_add_one = request_count % threads_.size();
+
     size_t seq_stat_index_offset = 0;
     active_threads_ = 0;
     for (size_t i = 0; i < threads_stat_.size(); i++) {
       size_t concurrency = avg_concurrency + (i < threads_add_one ? 1 : 0);
 
       threads_config_[i]->concurrency_ = concurrency;
       threads_config_[i]->seq_stat_index_offset_ = seq_stat_index_offset;
+
+      size_t thread_num_reqs = avg_req_count + (i < req_count_add_one ? 1 : 0);
+      threads_config_[i]->num_requests_ = thread_num_reqs;
+
       seq_stat_index_offset += concurrency;
 
       if (concurrency) {
@@ -171,7 +179,7 @@ ConcurrencyManager::ResumeSequenceWorkers()
 std::shared_ptr<IWorker>
 ConcurrencyManager::MakeWorker(
     std::shared_ptr<ThreadStat> thread_stat,
-    std::shared_ptr<ConcurrencyWorker::ThreadConfig> thread_config)
+    std::shared_ptr<ThreadConfig> thread_config)
 {
   uint32_t id = workers_.size();
 

diff --git a/src/c++/perf_analyzer/concurrency_manager.h b/src/c++/perf_analyzer/concurrency_manager.h
@@ -1,4 +1,4 @@
-// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -89,14 +89,16 @@ class ConcurrencyManager : public LoadManager {
   /// Adjusts the number of concurrent requests to be the same as
   /// 'concurrent_request_count' (by creating or pausing threads)
   /// \param concurent_request_count The number of concurrent requests.
+  /// \param request_count The number of requests to generate. If 0, then
+  /// there is no limit, and it will generate until told to stop.
   /// \return cb::Error object indicating success or failure.
-  cb::Error ChangeConcurrencyLevel(const size_t concurrent_request_count);
+  cb::Error ChangeConcurrencyLevel(
+      const size_t concurrent_request_count, const size_t request_count = 0);
 
  protected:
   // Makes a new worker
   virtual std::shared_ptr<IWorker> MakeWorker(
-      std::shared_ptr<ThreadStat>,
-      std::shared_ptr<ConcurrencyWorker::ThreadConfig>);
+      std::shared_ptr<ThreadStat>, std::shared_ptr<ThreadConfig>);
 
   ConcurrencyManager(
       const bool async, const bool streaming, const int32_t batch_size,
@@ -114,7 +116,7 @@ class ConcurrencyManager : public LoadManager {
 
   size_t max_concurrency_;
 
-  std::vector<std::shared_ptr<ConcurrencyWorker::ThreadConfig>> threads_config_;
+  std::vector<std::shared_ptr<ThreadConfig>> threads_config_;
 
  private:
   void InitManagerFinalize() override;
@@ -126,7 +128,7 @@ class ConcurrencyManager : public LoadManager {
   // Create new threads (if necessary), and then reconfigure all worker threads
   // to handle the new concurrent request count
   //
-  void ReconfigThreads(size_t concurrent_request_count);
+  void ReconfigThreads(size_t concurrent_request_count, size_t request_count);
 
   // Restart all worker threads that were working on sequences
   //

diff --git a/src/c++/perf_analyzer/concurrency_worker.h b/src/c++/perf_analyzer/concurrency_worker.h
@@ -1,4 +1,4 @@
-// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -29,6 +29,7 @@
 
 #include "load_worker.h"
 #include "sequence_manager.h"
+#include "thread_config.h"
 
 namespace triton { namespace perfanalyzer {
 
@@ -49,28 +50,6 @@ class NaggyMockConcurrencyWorker;
 ///
 class ConcurrencyWorker : public LoadWorker {
  public:
-  struct ThreadConfig {
-    ThreadConfig(
-        size_t thread_id, size_t concurrency = 0,
-        size_t seq_stat_index_offset = 0)
-        : thread_id_(thread_id), concurrency_(concurrency),
-          seq_stat_index_offset_(seq_stat_index_offset), is_paused_(false)
-    {
-    }
-
-    // ID of corresponding worker thread
-    size_t thread_id_;
-
-    // The concurrency level that the worker should produce
-    size_t concurrency_;
-
-    // The starting sequence stat index for this worker
-    size_t seq_stat_index_offset_;
-
-    // Whether or not the thread is issuing new inference requests
-    bool is_paused_;
-  };
-
   ConcurrencyWorker(
       uint32_t id, std::shared_ptr<ThreadStat> thread_stat,
       std::shared_ptr<ThreadConfig> thread_config,
@@ -85,11 +64,11 @@ class ConcurrencyWorker : public LoadWorker {
       const std::shared_ptr<IInferDataManager>& infer_data_manager,
       std::shared_ptr<SequenceManager> sequence_manager)
       : LoadWorker(
-            id, thread_stat, parser, data_loader, factory, on_sequence_model,
-            async, streaming, batch_size, using_json_data, wake_signal,
-            wake_mutex, execute, infer_data_manager, sequence_manager),
-        thread_config_(thread_config), max_concurrency_(max_concurrency),
-        active_threads_(active_threads)
+            id, thread_stat, thread_config, parser, data_loader, factory,
+            on_sequence_model, async, streaming, batch_size, using_json_data,
+            wake_signal, wake_mutex, execute, infer_data_manager,
+            sequence_manager),
+        max_concurrency_(max_concurrency), active_threads_(active_threads)
   {
   }
 
@@ -109,8 +88,6 @@ class ConcurrencyWorker : public LoadWorker {
   // threads?
   size_t& active_threads_;
 
-  std::shared_ptr<ThreadConfig> thread_config_;
-
   // Handle the case where execute_ is false
   void HandleExecuteOff();
 

diff --git a/src/c++/perf_analyzer/custom_load_manager.cc b/src/c++/perf_analyzer/custom_load_manager.cc
@@ -1,4 +1,4 @@
-// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -76,10 +76,10 @@ CustomLoadManager::CustomLoadManager(
 }
 
 cb::Error
-CustomLoadManager::InitCustomIntervals()
+CustomLoadManager::InitCustomIntervals(const size_t request_count)
 {
   PauseWorkers();
-  ConfigureThreads();
+  ConfigureThreads(request_count);
   auto status = GenerateSchedule();
   ResumeWorkers();
   return status;

diff --git a/src/c++/perf_analyzer/custom_load_manager.h b/src/c++/perf_analyzer/custom_load_manager.h
@@ -1,4 +1,4 @@
-// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -88,8 +88,10 @@ class CustomLoadManager : public RequestRateManager {
 
   /// Initializes the load manager with the provided file containing request
   /// intervals
+  /// \param request_count The number of requests to generate. If 0, then
+  /// there is no limit, and it will generate until told to stop.
   /// \return cb::Error object indicating success or failure.
-  cb::Error InitCustomIntervals();
+  cb::Error InitCustomIntervals(const size_t request_count);
 
   /// Computes the request rate from the time interval file. Fails with an error
   /// if the file is not present or is empty.