[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

raghibomar786 · 2024-11-05T16:19:44Z

Description

How can we identify a row uniquely in a FOCUS CUR file? AWS CUR V1 has Identity columns for this: https://docs.aws.amazon.com/cur/latest/userguide/identity-columns.html
Couldn't find a counterpart in FOCUS CUR files.

Proposed Approach

TBD by maintainers

GitHub Issue or Reference

No response

Context

No response

Data Submission for Discussion

No response

shawnalpay · 2024-11-05T17:42:56Z

Hi @raghibomar786 -- thanks for adding an issue to the backlog! We have previously discussed adding an identity column, but it has not been prioritized, per the limitations described in the link you provided:

This field is generated for each line item and is unique in a given partition. This does not guarantee that the field will be unique across an entire delivery (that is, all partitions in an update) of the AWS CUR. The line item ID isn't consistent between different Cost and Usage Reports and can't be used to identify the same line item across different reports.

Given these limitations, could you please define a use case for this column as it is currently defined? That would really help us scope and prioritize this request. Thanks!

jpradocueva · 2024-11-06T00:26:33Z

Action Items from the TF-1 call on November 5:

[#639] Shawn @shawnalpay : Respond to the submitter, requesting more details on their specific use case for a unique row identifier to better understand their requirements.
[#639] Documentation Team: Add guidance in the FOCUS documentation on possible column combinations practitioners could use to approximate a unique identifier.
[#639] Alex @ahullah : Explore whether certain FOCUS columns could have optional metadata attributes (e.g., "immutable") to help practitioners with tracking and row identification.

kk09v · 2024-11-07T20:13:48Z

I achieve something like this in my data by hashing the records (technically a subset of the fields).

jpradocueva · 2024-11-08T03:39:47Z

Action Items from the Members' call on Nov 7:

[#639] Alex @ahullah : Document the proposed method for generating unique row identifiers using composite hashing of key columns and share it with the group for review.
[#639] All Members: Provide feedback on the composite hashing method and discuss any additional suggestions for unique row identification.

raghibomar786 · 2024-11-08T07:09:46Z

@shawnalpay In our usecase, we create some rules based on which we allocate cost of each row to an entity defined in the rule. Each row cost may be split among multiple entities depending upon the rule. For auditing purposes, we need to store the information of how the cost was allocated. For this, we create an audit report where we put up identity column against the rule(s) for each partition so that actual allocation could be backtracked using joined queries.

raghibomar786 · 2024-11-08T07:11:37Z

@kk09v We thought of hashing, and it works, but for large customers (having around 500MM rows in their CUR files) computing hash for each row becomes a costly operation that impacts our performance.

shawnalpay · 2024-11-20T16:42:35Z

@raghibomar786 I just want to make sure I'm very clear on the use case. Which of the following are you requesting?

The AWS identity column as it currently exists, knowing that it is not unique (or consistent) across files and therefore does not represent uniqueness.
Net-new functionality in the form of a truly unique identifier that, to be clear, does not yet exist in any provider-native or FOCUS format.

Above, @kk09v was describing how he fulfills that second need by performing his own downstream hashing in a post-processing routine. In your original post, you've requested the AWS identity column, but I fear that it may not fulfill the use case as you've described it. If you are using identity/LineItemId for the purposes of allocation and then assuming uniqueness downstream, then you may encounter the same line item ID multiple times, and that could generate incorrect allocations.

Please let me know which of those two above needs best represents your ask. Thanks!

raghibomar786 · 2024-11-21T08:34:49Z

@shawnalpay The use-case 2 that you mentioned would be ideal. But the first use-case also works for us since the audit reports that we create stores information about the partition as well. So having a unique identifier within a partition will also solve our use case, though a universal global identifier would be even better for sure. As I mentioned in the comment above, hashing works but becomes an expensive solution for large datasets.

jpradocueva · 2024-11-26T14:18:24Z

Summary from the Maintainers' call on Nov 25

Context:
This PR aims to add a requirement for a unique identifier for each row in the FOCUS dataset. The goal is to ensure data integrity and traceability, particularly in scenarios involving reconciliation, auditing, or merging datasets from multiple sources.

raghibomar786 · 2024-11-28T07:17:01Z

@jpradocueva Thanks for the context. Is there a way for us to access the PR so that we can remain updated with the changes?

raghibomar786 added the discussion topic Item or question to be discussed by the community label Nov 5, 2024

raghibomar786 assigned shawnalpay Nov 5, 2024

github-project-automation bot added this to FOCUS WG Nov 5, 2024

github-project-automation bot moved this to Triage in FOCUS WG Nov 5, 2024

shawnalpay added needs backlog review Items to review with members and confirm whether to close or carry forward dimensionality Fields that describe / group / filter metrics csp Cloud service providers labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

raghibomar786 commented Nov 5, 2024

shawnalpay commented Nov 5, 2024

jpradocueva commented Nov 6, 2024

kk09v commented Nov 7, 2024

jpradocueva commented Nov 8, 2024

raghibomar786 commented Nov 8, 2024

raghibomar786 commented Nov 8, 2024

shawnalpay commented Nov 20, 2024

raghibomar786 commented Nov 21, 2024

jpradocueva commented Nov 26, 2024

raghibomar786 commented Nov 28, 2024

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

Comments

raghibomar786 commented Nov 5, 2024

Description

Proposed Approach

GitHub Issue or Reference

Context

Data Submission for Discussion

shawnalpay commented Nov 5, 2024

jpradocueva commented Nov 6, 2024

Action Items from the TF-1 call on November 5:

kk09v commented Nov 7, 2024

jpradocueva commented Nov 8, 2024

Action Items from the Members' call on Nov 7:

raghibomar786 commented Nov 8, 2024

raghibomar786 commented Nov 8, 2024

shawnalpay commented Nov 20, 2024

raghibomar786 commented Nov 21, 2024

jpradocueva commented Nov 26, 2024

Summary from the Maintainers' call on Nov 25

raghibomar786 commented Nov 28, 2024