-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minion Task to support automatic Segment Refresh #14300
base: master
Are you sure you want to change the base?
Conversation
43691e7
to
f485324
Compare
@vvivekiyer : the idea is quite interesting and the value add I see here is:
But for Upserts I think one of the biggest cost is recomputing the validDocId map, so for Upsert tables we won't see any specific benefits right? (outside of the ones which are applicable for Realtime tables too). |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14300 +/- ##
============================================
+ Coverage 61.75% 63.72% +1.97%
- Complexity 207 1566 +1359
============================================
Files 2436 2667 +231
Lines 133233 146351 +13118
Branches 20636 22414 +1778
============================================
+ Hits 82274 93264 +10990
- Misses 44911 46198 +1287
- Partials 6048 6889 +841
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Are you suggesting an increase in concurrency at the minion level or the server level? At the server level, it seems we would still issue a SegmentRefreshTask, which means the default concurrency would remain at 1. We can investigate performance improvements that might allow us to adjust the concurrency configuration. Overall, this appears to be a valuable feature to reduce index build time and associated costs for servers! However, we need to consider the trade-off between SegmentRefresh and SegmentReload costs. Ultimately, we would still issue a SegmentRefresh call to the servers, if I understand correctly. For upsert tables with snapshot enabled, we risk losing validDocIDSnapshot during downloads from deep store since deep store lacks snapshot copies. This could potentially increase refresh times for these tables, as we wouldn't be able to utilize the preload feature. |
Yes, that's right. Exploring possibilities here - if we couple segment refresh minion task to also do other things (like upsert compaction, etc), will that help? |
The benefits of including compaction in this task will vary from use-case to use-case depending on the number of invalid docIDs. |
...in/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskGenerator.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskGenerator.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskGenerator.java
Outdated
Show resolved
Hide resolved
// We set _taskStartTime before fetching the tableConfig. Task Generation relies on tableConfig/Schema updates | ||
// happening after the last processed time. So we explicity use the timestamp before fetching tableConfig as the | ||
// processedTime. | ||
_taskStartTime = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for table divergence in the task generator, between segment metadata & table config, will avoid this state/make the detection stateless.
getSegmentsMetadata in TableMetadataReader, gives the segment metadata and we can compare the metadata with table config to determine which ones need Refresh, in the generator itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are pros and cons to either approach:
- With the current way - we pay the computation cost on the Minions (and the controllers/helix).
- If we use
getSegmentsMetadata
we will pay computation cost on the Pinot Servers.
I resolved to (1) to avoid overloading the servers. Do you see any compelling reason for going with approach (2) and issuing a call to Pinot Servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2), the cost should be minimal since the API will be called once every scheduled trigger. However we are shipping the metadata payload to the controller which can add to the controller heap (if you have 1000's of segments). This is one downside. The controller/task generator uses the metadata to check if a segment has diverged from table config.
The reason I suggested that is we will know exactly whether subtasks need to be triggered or not, instead of triggering subtasks for every table update even if its unrelated (if I understood the trigger logic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this. The compute/memory on minion is cheaper and I'm inclined to doing it this way. Moreover, doing it at minion gives us the flexibility of also support compression type changes on segments (where we have to initialize the forward index reader to get the compression info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns are, we are unnecessarily doing the following when refresh is not required (partition changes, transformer/filter changes etc.)
- Creating a large number of subtasks
- Going through BaseSingleSegmentConversionExecutor which modifies Zk metadata and even uploads the segment. Although looking at the code, it seems that we dont do either when segment file is not present in conversion result. What is expected from the executor when a segment need not be refreshed? Do we have to update Zk so that its not picked again.
@vrajat is working on the server side API to detect diverged segments. This will reduce the payload by not shipping the segment metadata/but only sending the segment names that require refresh. The call to the server will only happen during task scheduled trigger (order of minutes/hours) which in my opinion should not put load on the server. Until then we can use getSegmentMetadata call on the server?
We cannot get this on the server?
Moreover, doing it at minion gives us the flexibility of also support compression type changes on segments (where we have to initialize the forward index reader to get the compression info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see recently we added an API on server / controller to see if reload is required or not: #12117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the pass through comment. I've updated the description with the detailed reasoning for not using server APIs to detect this. Please take a look at the "Notes on Implementation" section. @swaminathanmanish I've also pinged on slack to quickly align on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with @swaminathanmanish offline. We'll go with these changes now to avoid the controller heap blowup by transporting all segment metadata. We can use the API in #14450 once it is merged.
...ain/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskExecutor.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskExecutor.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskExecutor.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskGenerator.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/plugin/minion/tasks/segmentrefresh/SegmentRefreshTaskExecutor.java
Outdated
Show resolved
Hide resolved
@@ -112,7 +112,7 @@ | |||
import org.apache.pinot.segment.local.utils.TableConfigUtils; | |||
import org.apache.pinot.spi.config.table.SegmentsValidationAndRetentionConfig; | |||
import org.apache.pinot.spi.config.table.TableConfig; | |||
import org.apache.pinot.spi.config.table.TableStats; | |||
import org.apache.pinot.spi.config.table.TableStatsHumanReadable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this change (TableStats -> TableStatsHumanReadable) to a separate PR for cleaner rollbacks / cherry-picks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do it in this PR because I'm making use of TableStats and there's a dependency.
...in/java/org/apache/pinot/plugin/minion/tasks/refreshsegment/RefreshSegmentTaskGenerator.java
Outdated
Show resolved
Hide resolved
// tableMTime > segmentZKMetadata.getCreationTime() || schemaMTime > segmentZKMetadata.getCreationTime(); | ||
|
||
boolean segmentProcessedBeforeUpdate = tableMTime > lastProcessedTime || schemaMTime > lastProcessedTime; | ||
return segmentProcessedBeforeUpdate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add a crc check to figure out if we need to trigger a refresh or not. This way we also ensure deepstore copy gets updated with latest indexes / schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get this. Are you suggesting we need to check the crc in the ZK metadata against the crc in deepstore? They are bound to always be the same right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessarily not! I have seen this a lot of times in Upsert-compaction task as well: #13491.
This is some race-condition which we should solve fundamentally but I think for now we can let this task refresh the segment in deepstore anyways.
...ain/java/org/apache/pinot/plugin/minion/tasks/refreshsegment/RefreshSegmentTaskExecutor.java
Show resolved
Hide resolved
import static org.testng.Assert.*; | ||
|
||
|
||
public class RefreshSegmentMinionClusterIntegrationTest extends BaseClusterIntegrationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test where refresh is not expected to happen and we validate that task is not running ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test.
c145418
to
2ee3b96
Compare
2ee3b96
to
9a7814c
Compare
Thanks for capturing our discussion in the description.
|
|
||
|
||
@TaskExecutorFactory | ||
public class SegmentRefreshTaskExecutorFactory implements PinotTaskExecutorFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename this as well to RefreshSegment...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer "SegmentRefreshTask" over "RefreshSegmentTask". Whichever name you pick, please make sure it's used everywhere.
|
||
|
||
@EventObserverFactory | ||
public class SegmentRefreshTaskProgressObserverFactory extends BaseMinionProgressObserverFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename this as well to RefreshSegment...?
// We just need to update the ZK metadata with the last refresh time to avoid getting picked up again. As the CRC | ||
// check will match, this will only end up being a ZK update. | ||
return new SegmentConversionResult.Builder().setTableNameWithType(tableNameWithType) | ||
.setFile(indexDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you verify that the upload path is lightweight and updates only Zk path? (i.e checks only when there's crc mismatch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a few minor comments, LGTM.
if (fieldSpecInSchema.isVirtualColumn()) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do virtual columns show up in the schema?
SegmentConversionResult segmentConversionResult) { | ||
return new SegmentZKMetadataCustomMapModifier(SegmentZKMetadataCustomMapModifier.ModifyMode.UPDATE, | ||
Collections.singletonMap(MinionConstants.RefreshSegmentTask.TASK_TYPE + MinionConstants.TASK_TIME_SUFFIX, | ||
String.valueOf(_taskStartTime))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put human readable time string? Something like 2024-11-09T03:21:59.989Z
makes debugging much easier.
|
||
|
||
@TaskExecutorFactory | ||
public class SegmentRefreshTaskExecutorFactory implements PinotTaskExecutorFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer "SegmentRefreshTask" over "RefreshSegmentTask". Whichever name you pick, please make sure it's used everywhere.
Currently, when new columns are added or indexes are added/removed, the segment reloads happen on the server. There are a number of issues with this approach:
This PR creates a minion task to automatically refresh segments when there are index/column updates to table config/schema. It can support automatic refresh for the following operations:
Followup Work:
Notes on Implementation
The premise used to solve this was: Keep the deepstore segment in sync with table Config (this will automatically make sure that the servers have the updated segments). Please see #9360. Keeping deep store in sync becomes crucial for: Reducing server startup times when servers are replaced/migrated.
Task Generation Flow:
Task Execution Flow:
Relying on a server API call to indicate whether a segment needs to be refreshed was not preferred because:
Servers might indicate that a segment doesn’t need refresh (using a mechanism like Support API for checking if segments need to be reloaded for a table #12117) just because they were restarted. This will still leave the segments on deepstore outdated.
Server Preprocess currently supports very limited operations. As we add more capability like datatype changes/compression changes, relying on server Preprocess will give the wrong signal just because serverPreprocess doesn’t support the operation.
Using server APIs to get all segment Metadata to the controller for all tables every time the periodic task runs can be overkill.
Cons of this approach is that there will be minion tasks created for all segments for each table config update.
To overcome this problem, we can use a server side API that will return the list of segments to be refreshed. It is being developed in #14450. We can incorporate these changes in the Task Generation Flow once it is merged. (cc: @swaminathanmanish)