-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639
Comments
Hi @raghibomar786 -- thanks for adding an issue to the backlog! We have previously discussed adding an identity column, but it has not been prioritized, per the limitations described in the link you provided:
Given these limitations, could you please define a use case for this column as it is currently defined? That would really help us scope and prioritize this request. Thanks! |
Action Items from the TF-1 call on November 5:
|
I achieve something like this in my data by hashing the records (technically a subset of the fields). |
Action Items from the Members' call on Nov 7:
|
@shawnalpay In our usecase, we create some rules based on which we allocate cost of each row to an entity defined in the rule. Each row cost may be split among multiple entities depending upon the rule. For auditing purposes, we need to store the information of how the cost was allocated. For this, we create an audit report where we put up identity column against the rule(s) for each partition so that actual allocation could be backtracked using joined queries. |
@kk09v We thought of hashing, and it works, but for large customers (having around 500MM rows in their CUR files) computing hash for each row becomes a costly operation that impacts our performance. |
@raghibomar786 I just want to make sure I'm very clear on the use case. Which of the following are you requesting?
Above, @kk09v was describing how he fulfills that second need by performing his own downstream hashing in a post-processing routine. In your original post, you've requested the AWS identity column, but I fear that it may not fulfill the use case as you've described it. If you are using Please let me know which of those two above needs best represents your ask. Thanks! |
@shawnalpay The use-case 2 that you mentioned would be ideal. But the first use-case also works for us since the audit reports that we create stores information about the partition as well. So having a unique identifier within a partition will also solve our use case, though a universal global identifier would be even better for sure. As I mentioned in the comment above, hashing works but becomes an expensive solution for large datasets. |
Summary from the Maintainers' call on Nov 25Context: |
@jpradocueva Thanks for the context. Is there a way for us to access the PR so that we can remain updated with the changes? |
Description
How can we identify a row uniquely in a FOCUS CUR file? AWS CUR V1 has Identity columns for this: https://docs.aws.amazon.com/cur/latest/userguide/identity-columns.html
Couldn't find a counterpart in FOCUS CUR files.
Proposed Approach
TBD by maintainers
GitHub Issue or Reference
No response
Context
No response
Data Submission for Discussion
No response
The text was updated successfully, but these errors were encountered: