Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Discover any AMA appeal states that won't move forward automatically #14307

Closed
2 tasks
lomky opened this issue May 15, 2020 · 6 comments
Closed
2 tasks
Assignees
Labels
Priority: High Escalations from Support, blocking issue/NO workaround, or "first in" priority for new work. Product: caseflow-queue Source: BVA Reported Team: Echo 🐬 Type: Investigation

Comments

@lomky
Copy link
Contributor

lomky commented May 15, 2020

Description

We recently had two appeals reported in an unexpected state where their Distribution Tasks were cancelled, but they had other trees open. We believe these were affected by a bug we corrected last year.

The goal of this ticket is to look into the state of current AMA appeals and discover any potential orphan path in our codebase currently, and if any are found.

Acceptance criteria

  • Any extant Appeals in a bad state are either fixed, or grouped into new issues to be fixed immediately following on this.
  • Any issues discovered in our current task code flow have issues filed to address them

Background/context/resources

Technical notes

@lomky lomky added Type: Investigation Product: caseflow-queue Team: Echo 🐬 Priority: High Escalations from Support, blocking issue/NO workaround, or "first in" priority for new work. Source: BVA Reported labels May 15, 2020
@hschallhorn
Copy link
Contributor

is this a delta ticket?

@yoomlam
Copy link
Contributor

yoomlam commented May 21, 2020

Timebox to 3.
Difference from StuckAppealsCheckerJob or OpenTasksWithClosedParentsCheckerJob?
How did appeals get here? What's their status? Where should they go?
May be relevant to RAMP appeals only, but should check if AMA appeals face similar problems.

@pkarman
Copy link
Contributor

pkarman commented Jun 1, 2020

One pattern is that we have appeals with closed root tasks but open child tasks.

See https://dsva.slack.com/archives/CJL810329/p1590527026173900 and AppealsWithClosedRootTaskOpenChildrenQuery.

Appeals with open child tasks should have open root tasks.

@yoomlam
Copy link
Contributor

yoomlam commented Sep 10, 2020

Approach

  • understand previous causes for stuck AMA appeals
    • examine existing tickets, jobs, and Sentry alerts -- done in this comment
  • query all appeals, eliminate appeals in a "good state" (see "Good states" section in next comment below). Then examine the rest of the appeals.
  • consider other possible causes by examining code (if time permits)

(Bullet items prefixed with "TODO" indicate a potential ticket to be created. TODO items that refer to creating a job imply a manual fix may be needed after investigation of alerts from the job.)

Related tickets

Some stuck-appeal scenarios are being addressed/fixed by these tickets:

Related jobs

These jobs send alerts to #appeals-job-alerts.

  • Job StuckAppealsChecker performs the following queries:
    • AppealsWithNoTasksOrAllTasksOnHoldQuery finds:
      • active appeals with open (assigned, in_progress, on_hold) tasks that are all on_hold
        • As long as a task is assigned to an active user, appeal should progress.
        • Task.where(status: :in_progress).group(:type).count returns a list of 42 task types.
        • on_hold tasks should have an active (assigned, in_progress) descendant task -- Slack convo
          • TODO: (Ticket Create job to check that on_hold tasks should have an active descendant task #15267) Create job to check that on_hold tasks should have an active descendant task; a more general version of OpenHearingTasksWithoutActiveDescendantsChecker like this (but with better performance):
            Task.on_hold.where.not(type: [RootTask.name, TrackVeteranTask.name])
              .find_in_batches(batch_size: 5000).each {|group| group.select { |task|
                descendants_excluding_self = task.descendants - [task]
                descendants_excluding_self.map(&:active?).exclude?(true)
            } }
            => nil
      • active appeals with 0 or 1 task
        • I assume the 1 task is a RootTask.
      • active appeals with 2 tasks and no DistributionTask
        • Reasoning behind this check? Kat: "Every AMA appeal should have a DistributionTask created during the completion of the intake process. I presume we're looking for 'at least two' to avoid an edge case where we've created the root but not yet the distribution (narrow window)"
      • dispatched appeals (completed BvaDispatchTask) with on_hold RootTask
        • TODO: Should we alert on other open tasks after an appeal is dispatched? Do we care about the progress of dispatched appeals (if so, how is progress measured)? Related ticket Discover and correct improperly undispatched appeals #14884.
          • Sentry alerts are already sent by OpenTasksWithClosedAtChecker
    • AppealsWithClosedRootTaskOpenChildrenQuery: appeals with closed RootTask and open children tasks, ignoring tasks without parents
  • Job PendingIncompleteAndUncancelledTaskTimersChecker finds pending and incomplete TaskTimers.
  • Job UntrackedLegacyAppealsChecker finds "legacy appeals charged to CASEFLOW in VACOLS with no active Caseflow tasks".
  • Other jobs reviewed but do not indicate a stuck appeal (that Echo needs to address):
    • OpenTasksWithClosedAtChecker: task is open but closed_at is set or parent is closed
    • StuckVirtualHearingsChecker: if stuck for too long, then Tango is notified for manual intervention

Related Sentry alerts

Other scenarios

  • Task assignee becomes inactive
    • UI allows another user to reassign task
    • Rely on users to know which tasks to reassign (i.e. all tasks assigned to inactive user)
      • TODO: (Ticket Create job to check for inactive users that are the task assignee or assigner #15269) Create job to check for inactive users that are the assignee or assigner? Some are manually fixed in Reassign Tasks for Inactive Accounts Part 2! #14483
        ius=User.inactive.pluck(:id);
        
        ts=Task.includes(:assigned_to).open.where(assigned_to_id: ius);
        ts.count => 13
        pp ts.group(:type).count
        => {"AttorneyRewriteTask"=>2,
           "AttorneyTask"=>3,
           "ReconsiderationMotionMailTask"=>1,
           "ReturnedUndeliverableCorrespondenceMailTask"=>1,
           "StatusInquiryMailTask"=>6}
        
        ts2=Task.includes(:assigned_by).open.where(assigned_by_id: ius);
        ts2.count => 87
        pp ts2.group(:type).count
        {"ExtensionColocatedTask"=>7,
         "FoiaColocatedTask"=>21,
         "FoiaTask"=>40,
         "IhpColocatedTask"=>2,
         "MissingRecordsColocatedTask"=>4,
         "OtherColocatedTask"=>8,
         "PrivacyActTask"=>2,
         "TranscriptionTask"=>1,
         "TranslationColocatedTask"=>1,
         "TranslationTask"=>1}
  • Removing all issues on an appeal causes appeal to be canceled
    • not considered stuck, if action was intentional.
    • TODO: review code for similar scenarios where Caseflow automatically cancels or closes an appeal or RootTask like these:
      app/workflows/ama_appeal_dispatch.rb:74:    dispatch_task.root_task.update!(status: Constants.TASK_STATUSES.completed)
      app/workflows/legacy_appeal_dispatch.rb:37:    @appeal.root_task.update!(status: Constants.TASK_STATUSES.completed)
    • Alec: seems like a worthwhile investigation.  That said what do we want to do with the information once we have it? Simply document or perhaps actionable depending on the findings? What are we considering “intentional action” when it comes to removing all issues on an appeal?  (don’t expect an actual answer here just thoughts that occurred)
    • Kat might have written a ticket on this Discover and correct improperly undispatched appeals #14884
  • If appeal is in a hearing docket and the hearing is withdrawn, then the appeal be cancelled.
  • TODO (may not be worth the effort since user should see 2 tasks in their queue and can cancel one of them to unstuck the appeal; job might be useful for detecting bugs): Job to check for duplicate tasks, i.e. tasks of the same type (and same parent task?) and assigned to the same user. Appeal will be stuck if the assignee closes only one of them. Motivated by "[Epic] Prevent duplicate TranslationTasks for an appeal" [Epic] Prevent duplicate TranslationTasks for an appeal #11176 and a recent user-caused duplication https://github.com/department-of-veterans-affairs/dsva-vacols/issues/145.

@yoomlam
Copy link
Contributor

yoomlam commented Sep 11, 2020

"Bad states"

"bad state" = "stuck" appeal that will not proceed without engineer intervention

Appeals

  • with insufficient number of tasks (see StuckAppealsChecker job in prior comment that checks for active appeals with 0, 1, or 2 tasks)
  • with open tasks that are all on_hold
  • with unexpected task tree structure
    • dispatched appeals with open tasks (i.e., RootTask)
    • closed RootTask with open children tasks
    • TODO: Job to check for closed parent with open children tasks. AKA: open child task should have open parents and ancestors. (Not sure if this could cause a stuck appeal, but it's a good data integrity check.) (Related data clean-up ticket Clean up open tasks with closed parents #13438 and Investigate UNKNOWN Appeal status Investigate UNKNOWN Appeal status #14255)
      • Already covered by OpenTasksWithClosedAtChecker
    • TODO: (Ticket Appeals with unusual task tree structures #15271) What are other unexpected task trees that could occur? Should we create a ticket where we can add appeals with unusual tree structures?
  • with no updates in N days (used to alert us to investigate other reasons for stuck appeals)
    • TODO: (Ticket Determine the max number of days an appeal should be without activity #15272) To determine a good value for N, get statistics for the length of time since last update (using the latest task updated_at or appeal updated_at) for active appeals.
      as=Appeal.active.first(10).map{|a| [(Time.now - a.updated_at).to_i / (24 * 60 * 60), a.id] }.select{|d,a| d>90}.sort_by(&:first);
      as.count => 16321
      as.max_by(&:first) => [353, 31568]  # max 353 days since appeal was updated
      
      as2=as.map{|d,a| [(Time.now - Appeal.find(a).tasks.max_by(&:updated_at).updated_at).to_i / (24 * 60 * 60), a]}.select{|d,a| d>90}.sort_by(&:first);
      as2.count => 14895   # about 1500 appeals had their task updated within last 90 days
      as2.max_by(&:first) => [470, 6448]  # max of 470 days since task was updated (among appeals above)
      • Alec: sort of waffling on this one tbh. I _feel _like this is a good metric to track, and although I don’t know immediately how we could use the info it would probably be good to have at least for reporting purposes

Tasks

  • with pending and incomplete TaskTimers -- see PendingIncompleteAndUncancelledTaskTimersChecker in prior comment
  • assigned to inactive or incorrect users
  • TODO: For org-tasks (those assigned to orgs) with a def self.default_assignee, get periodic statistics on tasks that aren't assigned to the default_assignee.
    • Tasks that define default_assignee: many *MailTasks, ColocatedTask, FoiaColocatedTask, MissingHearingTranscriptsColocatedTask, ScheduleHearingColocatedTask, and TranslationColocatedTask
      # Ignoring MailTasks for now
      tts=%w[ ColocatedTask FoiaColocatedTask MissingHearingTranscriptsColocatedTask ScheduleHearingColocatedTask TranslationColocatedTask]
      tts.map{|tt| [tt, tt.constantize.default_assignee.name]}
      => [["ColocatedTask", "VLJ Support Staff"],
       ["FoiaColocatedTask", "Privacy Team"],
       ["MissingHearingTranscriptsColocatedTask", "Transcription"],
       ["ScheduleHearingColocatedTask", "Hearings Management"],
       ["TranslationColocatedTask", "Translation"]]
      
      tts.map{|tt| [tt, tt.constantize.open.where.not(assigned_to_id: tt.constantize.default_assignee.id).count]}
      => [["ColocatedTask", 5153],
       ["FoiaColocatedTask", 0],
       ["MissingHearingTranscriptsColocatedTask", 0],
       ["ScheduleHearingColocatedTask", 3],
       ["TranslationColocatedTask", 0]]
      
      # Let's examine one of them
      tt=ScheduleHearingColocatedTask
      # what organizations do the assignees belong to?
      tt.open.where.not(assigned_to_id: tt.default_assignee.id).map(&:assigned_to).map{|u| u.organizations.pluck(:name)}
      => [["Hearings Management", "Transcription", "Hearing Admin"],
       ["Hearings Management", "Transcription", "Hearing Admin"],
       ["Hearings Management", "Transcription", "Hearing Admin", "Case Movement Team"]]
      # Good. They are all in the "Hearings Management" org (aka ScheduleHearingColocatedTask.default_assignee)

"Good states"

  • Task is assigned to an active user and assignee has actions to perform on task
    • TODO: Create job or add validation to check that an assignee (excluding BVA org and probably others) is active and has actions to perform on their task -- Task#available_actions(user).count > 0. Has this ever been a problem?
    • TODO: Job to check that assigned user is part of the proper team that can complete the task:
      LitigationSupport.singleton.users.include?(user)
    • Note: DistributionTask is not assigned to a user: SCM user could assign to any user
  • tasks assigned to orgs with active admins should progress
    • TODO: Job to check that all relevant orgs have an active admin
    • Some orgs don't have admins; members can assign tasks; e.g. VSOs
  • As long as a blocking task is assigned to an active user (or org?), the appeal should progress back to the task being blocked. Related ticket: Allow tasks to block dispatch Allow tasks to block dispatch #14056

TODO: Analyze code

  • verify in code when a child user-task completes, that the parent-task completes or is assigned (becomes active again)
  • investigate what happens when each task type is cancelled or closed
  • Based on MailTasks in prod, 3 MailTasks (i.e., AppealWithdrawalMailTask, PrivacyActRequestMailTask, ClearAndUnmistakeableErrorMailTask) are assigned to orgs only - Why do they not have child user tasks? What task actions are available and do they make sense?

@yoomlam
Copy link
Contributor

yoomlam commented Sep 17, 2020

Whew!

@yoomlam yoomlam closed this as completed Sep 17, 2020
va-bot pushed a commit that referenced this issue Jul 23, 2021
### Description
Add code comments to `AppealsWithNoTasksOrAllTasksOnHoldQuery` based on
#14307 (comment) and
https://github.com/department-of-veterans-affairs/caseflow/pull/15595/files#r522410495

### Acceptance Criteria
- [ ] Code compiles correctly

### Testing Plan
N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Escalations from Support, blocking issue/NO workaround, or "first in" priority for new work. Product: caseflow-queue Source: BVA Reported Team: Echo 🐬 Type: Investigation
Projects
None yet
Development

No branches or pull requests

4 participants