Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lastEvaluatedKey? #31

Open
patriknw opened this issue Jun 5, 2024 · 3 comments
Open

Use lastEvaluatedKey? #31

patriknw opened this issue Jun 5, 2024 · 3 comments

Comments

@patriknw
Copy link
Member

patriknw commented Jun 5, 2024

We could look into using response.lastEvaluatedKey and use that as exclusiveStartKey in query, instead of the timestamp for subsequent queries. Not sure how that works with GSI where the sort key isn't unique (same timestamp). If DynamoDB can keep track of the exact offset and not emit duplicates would not need the seen Map and that filter. Well, we still need it for the first query because we want the external offset to be TimestampOffset and that can include seen Map.

There is a corresponding FIXME for this.

@pvlugter
Copy link
Contributor

pvlugter commented Aug 15, 2024

Had a closer look at this, for a different reason. The paginated queries already use last evaluated key, so wanted to make sure that it wouldn't skip any events if it was the same timestamp at a page boundary. The last evaluated key for secondary indexes includes both the primary key from the index and the primary key from the base table. While it doesn't use the base table sorting, it looks like it sorts based on some hash of the table primary key — so it's a deterministic order, and will be able to do an exclusive start even if the index primary key is the same. So looks like we could use this in place of starting inclusive of the last timestamp and deduplicating.

@patriknw
Copy link
Member Author

Good to know. Is it worth changing? We kind of have to handle two ways if we use lastEvaluatedKey as mentioned above:

we still need it for the first query because we want the external offset to be TimestampOffset and that can include seen Map.

Maybe we just keep it as a nice to have future improvement? The cost benefit would be loading one less item in each such running query?

@pvlugter
Copy link
Contributor

Maybe worthwhile, to simplify the implementation side. Could always be used in place of full seen map — if it becomes an offset that includes the full key (index and base table primary key) and this is stored. Could be encoded in TimestampOffset — would be a single entry in the seen map, with the pid and seq nr for exclusive start key.

A similar approach could also be used for r2dbc. We already extended the queries to start from seq nr for the initial timestamp (for the 1000 events issue). That was also looking to be schema compatible. If we updated the sorted index to include pid as well, then could do an exclusive start query there as well.

Maybe points to some abstraction leakage, if the offset can't be specific to the implementation. Also with needing to add TimestampOffsetBySlice in core library. But we could use the current TimestampOffset, with only a single entry in the seen map for exclusive start.

Makes sense as a possible future improvement. Can try it out later. And could also try with r2dbc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants