Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compaction tasks reports getting overwritten #15981

Conversation

adithyachakilam
Copy link
Contributor

@adithyachakilam adithyachakilam commented Feb 27, 2024

Description

A single compaction task could be splitted the into multiple index tasks based on the interval given in spec. In such cases, all the index tasks are run with same id and the task completion report is getting over written. This PR skips writing the report for each parallel index task and instead writes on the compaction task completion.

With this change instead of overwriting the reports file, we make multiple entries and it looks something like:

{
  "ingestionStatsAndErrors_2": {
    "type": "ingestionStatsAndErrors",
    "taskId": "compact_test_klkggepm_2024-02-28T17:27:17.327Z",
    "payload": {
      "ingestionState": "COMPLETED",
      "unparseableEvents": {
        "buildSegments": []
      },
      "rowStats": {
        "totals": {
          "buildSegments": {
            "processed": 3,
            "processedBytes": 1500,
            "processedWithError": 0,
            "thrownAway": 0,
            "unparseable": 0
          }
        }
      },
      "errorMsg": null,
      "segmentAvailabilityConfirmed": false,
      "segmentAvailabilityWaitTimeMs": 0,
      "recordsProcessed": {}
    }
  },
  "ingestionStatsAndErrors_1": {
    "type": "ingestionStatsAndErrors",
    "taskId": "compact_test_klkggepm_2024-02-28T17:27:17.327Z",
    "payload": {
      "ingestionState": "COMPLETED",
      "unparseableEvents": {
        "buildSegments": []
      },
      "rowStats": {
        "totals": {
          "buildSegments": {
            "processed": 3,
            "processedBytes": 1500,
            "processedWithError": 0,
            "thrownAway": 0,
            "unparseable": 0
          }
        }
      },
      "errorMsg": null,
      "segmentAvailabilityConfirmed": false,
      "segmentAvailabilityWaitTimeMs": 0,
      "recordsProcessed": {}
    }
  },
  "ingestionStatsAndErrors_0": {
    "type": "ingestionStatsAndErrors",
    "taskId": "compact_test_klkggepm_2024-02-28T17:27:17.327Z",
    "payload": {
      "ingestionState": "COMPLETED",
      "unparseableEvents": {
        "buildSegments": []
      },
      "rowStats": {
        "totals": {
          "buildSegments": {
            "processed": 3,
            "processedBytes": 1500,
            "processedWithError": 0,
            "thrownAway": 0,
            "unparseable": 0
          }
        }
      },
      "errorMsg": null,
      "segmentAvailabilityConfirmed": false,
      "segmentAvailabilityWaitTimeMs": 0,
      "recordsProcessed": {}
    }
  }
}

Key changed/added classes in this PR
  • CompactionTask
  • ParallelIndexSupervisorTask

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@adithyachakilam adithyachakilam force-pushed the adithyachakilam/fix-compaction-task-reports-getting-overwritten branch from ed286b1 to f9a736a Compare February 27, 2024 19:05
Copy link
Contributor

@georgew5656 georgew5656 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, imo we don't need the task id in the keys of the reports though,

i think ingestionStatsAndErrors_0 ingestionStatsAndErrors_1 (similar to what is done by query_controller tasks) or maybe the interval that is being compacted instead would be good.

@adithyachakilam
Copy link
Contributor Author

@georgew5656 Modified to ingestionStatsAndErrors_0, ingestionStatsAndErrors_1

@arunramani
Copy link
Contributor

@cryptoe / @kfaraz this PR changes the structure of the task report for compaction. Are task reports considered part of a Druid API contract? In other words, are "breaking" changes okay for this?

@gianm
Copy link
Contributor

gianm commented Mar 1, 2024

@cryptoe / @kfaraz this PR changes the structure of the task report for compaction. Are task reports considered part of a Druid API contract? In other words, are "breaking" changes okay for this?

Is the current format documented? If so, we should consider options that preserve compatibility with previous documentation.

If not, changing the format is fair game. Although, if you are depending on the specific output format, you might want to add documentation, as otherwise it might be changed later in a way you don't expect.

@suneet-s
Copy link
Contributor

suneet-s commented Mar 1, 2024

I tested this change a little bit and found that if you run a compact task with no additional subtasks - the report is an empty json object. Before this change the report would have some details.

I think we should fix it so that the task report details are preserved for a compact task with no sub-tasks, even if we change the format of the report.

+1 for gian's suggestion of documenting the report so that other users don't break it once we've decided on a particular format. I think it would be a good idea to also add a test of some sort that validates the format of the report. There are a few integration tests for compaction - perhaps one of those would be a good place to add validation for the format of the task report.

docs/ingestion/tasks.md Outdated Show resolved Hide resolved
@georgew5656 georgew5656 self-requested a review March 1, 2024 20:52
Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me on the approach. Thanks for the integration test and docs @adithyachakilam!

@georgew5656 georgew5656 merged commit ec52f68 into apache:master Mar 4, 2024
83 checks passed
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @adithyachakilam , I couldn't get to reviewing this PR sooner.

I have left some comments for better code readability. There is also a minor concern regarding holding too many reports and causing an OOM.
Since this PR has already been merged, the comments can be addressed in a follow-up PR.

@@ -499,6 +502,7 @@ public TaskStatus runTask(TaskToolbox toolbox) throws Exception
log.info("Generated [%d] compaction task specs", totalNumSpecs);

int failCnt = 0;
Map<String, TaskReport> completionReports = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compaction is being run on several intervals (not very likely but still a possibility), can holding all the task reports in memory potentially cause an OOM exception? Currently, most of the task reports contain only ingestStatsAndErrors but they may contain other stuff in the future.

In the future, we should consider writing out the sub-reports in a streaming fashion alongwith the required changes to the TaskReportFileWriter API.
For now, we should add a guardrail here so that we don't try to hold too many reports in memory and fail with an OOM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think is a good size to hold then ?

@adithyachakilam
Copy link
Contributor Author

@kfaraz Trying to address them here: #16042

georgew5656 pushed a commit that referenced this pull request Mar 6, 2024
…6042)

* initial commit

* comments

* typo

* comments

* comments

* remove var

* initialize global var early

* remove new line

* small test fix

* same fix another test
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants