Stop dropping reads on the floor if more than one happens at once. (#…

…15338) Fixes #15304 The fix for #15304 is the change in the while loop condition in Engine::Run. Before that change, we would compare numReadHandled to the current count of allocated read handlers. But processing a read handler would deallocate it, so we would only end up processing half the read handlers that were outstanding on entry to Run (because after that numReadHandled would be larger than the remaining number allocated). The change in InteractionModelEngine::OnDone and the management of mRunningReadHandler are to handle a slightly more complicated situation I ran into while writing some tests for this code. If we have at least two subscriptions and some number of reads after them pending when Engine::Run is entered, when we would process the first read, it would be deallocated, mCurReadHandlerIdx would get reset to 0, we would then increment it by 1, and end up walking all but the first subscription again. Which means that our numReadHandled would increase to the point where the loop terminates before we had gotten to all the read handlers. If we had N subscriptions we would miss N-1 read handlers. Those could then get stuck in limbo indefinitely, until either a subscription heartbeat had to happen or someone else issued a read. The change in Engine::OnReportConfirm is to handle the case when we have more than CHIP_IM_MAX_REPORTS_IN_FLIGHT subscriptions that all need reporting, fire off the first CHIP_IM_MAX_REPORTS_IN_FLIGHT of them, and then never call ScheduleRun() after that, so all the other subscriptions get stuck. The issue with CHIP_IM_MAX_REPORTS_IN_FLIGHT was not being caught by the existing tests because those tests manually called Run() on the reporting engine (in a loop, in fact). That was needed because we could end up in a situation where DrainAndServiceIO() has processed all the pending messages, but we have a queued task (from ScheduleRun) that is not a message and will not get run, so Engine::Run was not getting called properly via the "normal" codepath in the test. This was fixed by removing all the manual Run() calls and making DrainAndServiceIO() do a better job of handling async things queued by message reception. TestReadAttributeTimeout had to be modified slightly because in the new setup we could no longer rely on DrainAndServiceIO() after we send the reads not triggering the reports and giving us a chance to tear down the session before the reports did get triggered. So instead of first expiring the client-to-server session, we expire the server-to-client one _before_ doing DrainAndServiceIO. This ensures that we never get replies to our reads, and then we can proceed to expire the client-to-server session, which gets treated as a timeout.
project-chip · Nov 20, 2023 · 4050107 · 4050107
1 parent 4616d19
commit 4050107
Show file tree

Hide file tree

Showing 5 changed files with 235 additions and 76 deletions.
diff --git a/src/app/InteractionModelEngine.cpp b/src/app/InteractionModelEngine.cpp
@@ -210,14 +210,14 @@ void InteractionModelEngine::OnDone(CommandHandler & apCommandObj)
 
 void InteractionModelEngine::OnDone(ReadHandler & apReadObj)
 {
-    mReadHandlers.ReleaseObject(&apReadObj);
-
     //
     // Deleting an item can shift down the contents of the underlying pool storage,
-    // rendering any tracker using positional indexes invalid. Let's reset it and
-    // have it start from index 0.
+    // rendering any tracker using positional indexes invalid. Let's reset it,
+    // based on which readHandler we are getting rid of.
     //
-    mReportingEngine.ResetReadHandlerTracker();
+    mReportingEngine.ResetReadHandlerTracker(&apReadObj);
+
+    mReadHandlers.ReleaseObject(&apReadObj);
 }
 
 CHIP_ERROR InteractionModelEngine::OnInvokeCommandRequest(Messaging::ExchangeContext * apExchangeContext,

diff --git a/src/app/reporting/Engine.cpp b/src/app/reporting/Engine.cpp
@@ -528,21 +528,29 @@ void Engine::Run()
 
     mRunScheduled = false;
 
-    while ((mNumReportsInFlight < CHIP_IM_MAX_REPORTS_IN_FLIGHT) && (numReadHandled < imEngine->mReadHandlers.Allocated()))
+    // We may be deallocating read handlers as we go.  Track how many we had
+    // initially, so we make sure to go through all of them.
+    size_t initialAllocated = imEngine->mReadHandlers.Allocated();
+    while ((mNumReportsInFlight < CHIP_IM_MAX_REPORTS_IN_FLIGHT) && (numReadHandled < initialAllocated))
     {
         ReadHandler * readHandler = imEngine->ActiveHandlerAt(mCurReadHandlerIdx % (uint32_t) imEngine->mReadHandlers.Allocated());
         VerifyOrDie(readHandler != nullptr);
 
         if (readHandler->IsReportable())
         {
-            CHIP_ERROR err = BuildAndSendSingleReportData(readHandler);
+            mRunningReadHandler = readHandler;
+            CHIP_ERROR err      = BuildAndSendSingleReportData(readHandler);
+            mRunningReadHandler = nullptr;
             if (err != CHIP_NO_ERROR)
             {
                 return;
             }
         }
 
         numReadHandled++;
+        // If readHandler removed itself from our list, we also decremented
+        // mCurReadHandlerIdx to account for that removal, so it's safe to
+        // increment here.
         mCurReadHandlerIdx++;
     }
 
@@ -678,6 +686,12 @@ void Engine::OnReportConfirm()
 {
     VerifyOrDie(mNumReportsInFlight > 0);
 
+    if (mNumReportsInFlight == CHIP_IM_MAX_REPORTS_IN_FLIGHT)
+    {
+        // We could have other things waiting to go now that this report is no
+        // longer in flight.
+        ScheduleRun();
+    }
     mNumReportsInFlight--;
     ChipLogDetail(DataManagement, "<RE> OnReportConfirm: NumReports = %" PRIu32, mNumReportsInFlight);
 }

diff --git a/src/app/reporting/Engine.h b/src/app/reporting/Engine.h
@@ -98,8 +98,25 @@ class Engine
 
     /*
      * Resets the tracker that tracks the currently serviced read handler.
-     */
-    void ResetReadHandlerTracker() { mCurReadHandlerIdx = 0; }
+     * apReadHandler can be non-null to indicate that the reset is due to a
+     * specific ReadHandler being deallocated.
+     */
+    void ResetReadHandlerTracker(ReadHandler * apReadHandlerBeingDeleted)
+    {
+        if (apReadHandlerBeingDeleted == mRunningReadHandler)
+        {
+            // Just decrement, so our increment after we finish running it will
+            // do the right thing.
+            --mCurReadHandlerIdx;
+        }
+        else
+        {
+            // No idea what to do here to make the indexing sane.  Just start at
+            // the beginning.  We need to do better here; see
+            // https://github.com/project-chip/connectedhomeip/issues/13809
+            mCurReadHandlerIdx = 0;
+        }
+    }
 
 private:
     friend class TestReportingEngine;
@@ -175,6 +192,11 @@ class Engine
      */
     uint32_t mCurReadHandlerIdx = 0;
 
+    /**
+     * The read handler we're calling BuildAndSendSingleReportData on right now.
+     */
+    ReadHandler * mRunningReadHandler = nullptr;
+
     /**
      *  mGlobalDirtySet is used to track the set of attribute/event paths marked dirty for reporting purposes.
      *