Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, API: Support scanning from refs #5364

Merged
merged 2 commits into from
Oct 3, 2022

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jul 27, 2022

This change adds support for scanning from refs, where the scan can either be a time travel from a tag or a time travel from the tip of a branch.

@amogh-jahagirdar amogh-jahagirdar force-pushed the scan-from-branch branch 3 times, most recently from 94fe662 to 78f7071 Compare July 27, 2022 03:48
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review August 7, 2022 17:39
@amogh-jahagirdar amogh-jahagirdar force-pushed the scan-from-branch branch 2 times, most recently from 920b5d0 to f3df820 Compare August 23, 2022 16:40
@amogh-jahagirdar amogh-jahagirdar force-pushed the scan-from-branch branch 3 times, most recently from 9dfc330 to dd0062d Compare August 26, 2022 01:02
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Aug 26, 2022

Discussed offline with @rdblue , I got an understanding of how history entries are different than the snapshot ancestor lineage, posting the information here so folks referring to the PR have context:

"History is the state of main at any given time. That can be changed, much like master in git can be changed.
Ancestors can't be changed."

The current main can change when rollback is performed. The snapshot which gets resolved for a given time T should remain constant regardless of what is the current main ancestor lineage. History entries capture all the changes on the main table state and are what enable this ability to time travel to a snapshot which are outside the current table ancestors which is ultimately desired behavior. It also doesn't make sense to treat main as a separate case, and have a different semantic for non-main branches when it comes to time travel.

Maybe we could maintain separate history metadata for branches, but most likely it would need to be opt-in by users for specific branches and would need to not add too much metadata to manage.

So for now, it makes sense just to do time travel to a branch tip or time travel to a given tag. Updated the PR @rdblue @namrathamyske @jackye1995

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. However I am a bit concerned about the fact that we cannot do time travel in branches, I see that as a pretty important feature for many use cases. I agree with the argument of the difference of tracing ancestor vs history entry, but in that case it seems like we need to update the spec to accommodate that change and record history entries in a way that allows time travel by branch.

@amogh-jahagirdar I think after this PR we should raise another PR against spec to discuss what is the best way to solve the time travel by branch use case.

@@ -39,13 +39,25 @@ public interface TableScan extends Scan<TableScan, FileScanTask, CombinedScanTas
*/
TableScan useSnapshot(long snapshotId);

/**
* Create a new {@link TableScan} from this scan's configuration that will use the given
* reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the snapshot ID of the given reference?

*
* @param timestampMillis a timestamp in milliseconds.
* @return a new scan based on this with the current snapshot at the given time
* @throws IllegalArgumentException if the snapshot cannot be found
* @throws IllegalArgumentException if the snapshot cannot be found or time travel is attempted on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc change here is no longer needed?

@@ -85,7 +85,7 @@ public TableScan appendsAfter(long fromSnapshotId) {
@Override
public TableScan useSnapshot(long scanSnapshotId) {
Preconditions.checkArgument(
snapshotId() == null, "Cannot override snapshot, already set to id=%s", snapshotId());
snapshotId() == null, "Cannot override snapshot, already set snapshot id=%s", snapshotId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "set to snapshot id=%s"?

@@ -46,6 +46,14 @@ public TableScan asOfTime(long timestampMillis) {
timestampMillis, context().fromSnapshotId(), context().toSnapshotId()));
}

@Override
public TableScan useRef(String ref) {
throw new UnsupportedOperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine for now, but we may have a case where we want incremental to be able to specify a branch. It doesn't matter currently because the snapshot IDs are always explicit.

@rdblue
Copy link
Contributor

rdblue commented Oct 2, 2022

Looks great. Thanks, @amogh-jahagirdar! I'll merge when tests are passing. I had to fix a conflict in revapi.

@rdblue rdblue merged commit 8b8a103 into apache:master Oct 3, 2022
@amogh-jahagirdar
Copy link
Contributor Author

Thanks @rdblue @yzhaoqin @namrathamyske @hililiwei for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants