Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

Open
raghibomar786 opened this issue Nov 5, 2024 · 10 comments
Open

[DISCUSSION]: How to uniquely identify a row in FOCUS file? #639

raghibomar786 opened this issue Nov 5, 2024 · 10 comments
Assignees
Labels
csp Cloud service providers dimensionality Fields that describe / group / filter metrics discussion topic Item or question to be discussed by the community needs backlog review Items to review with members and confirm whether to close or carry forward

Comments

@raghibomar786
Copy link

Description

How can we identify a row uniquely in a FOCUS CUR file? AWS CUR V1 has Identity columns for this: https://docs.aws.amazon.com/cur/latest/userguide/identity-columns.html
Couldn't find a counterpart in FOCUS CUR files.

Proposed Approach

TBD by maintainers

GitHub Issue or Reference

No response

Context

No response

Data Submission for Discussion

No response

@raghibomar786 raghibomar786 added the discussion topic Item or question to be discussed by the community label Nov 5, 2024
@github-project-automation github-project-automation bot moved this to Triage in FOCUS WG Nov 5, 2024
@shawnalpay
Copy link
Contributor

Hi @raghibomar786 -- thanks for adding an issue to the backlog! We have previously discussed adding an identity column, but it has not been prioritized, per the limitations described in the link you provided:

This field is generated for each line item and is unique in a given partition. This does not guarantee that the field will be unique across an entire delivery (that is, all partitions in an update) of the AWS CUR. The line item ID isn't consistent between different Cost and Usage Reports and can't be used to identify the same line item across different reports.

Given these limitations, could you please define a use case for this column as it is currently defined? That would really help us scope and prioritize this request. Thanks!

@shawnalpay shawnalpay added needs backlog review Items to review with members and confirm whether to close or carry forward dimensionality Fields that describe / group / filter metrics csp Cloud service providers labels Nov 5, 2024
@jpradocueva
Copy link
Contributor

Action Items from the TF-1 call on November 5:

  • [#639] Shawn @shawnalpay : Respond to the submitter, requesting more details on their specific use case for a unique row identifier to better understand their requirements.
  • [#639] Documentation Team: Add guidance in the FOCUS documentation on possible column combinations practitioners could use to approximate a unique identifier.
  • [#639] Alex @ahullah : Explore whether certain FOCUS columns could have optional metadata attributes (e.g., "immutable") to help practitioners with tracking and row identification.

@kk09v
Copy link
Contributor

kk09v commented Nov 7, 2024

I achieve something like this in my data by hashing the records (technically a subset of the fields).

@jpradocueva
Copy link
Contributor

Action Items from the Members' call on Nov 7:

  • [#639] Alex @ahullah : Document the proposed method for generating unique row identifiers using composite hashing of key columns and share it with the group for review.
  • [#639] All Members: Provide feedback on the composite hashing method and discuss any additional suggestions for unique row identification.

@raghibomar786
Copy link
Author

@shawnalpay In our usecase, we create some rules based on which we allocate cost of each row to an entity defined in the rule. Each row cost may be split among multiple entities depending upon the rule. For auditing purposes, we need to store the information of how the cost was allocated. For this, we create an audit report where we put up identity column against the rule(s) for each partition so that actual allocation could be backtracked using joined queries.

@raghibomar786
Copy link
Author

@kk09v We thought of hashing, and it works, but for large customers (having around 500MM rows in their CUR files) computing hash for each row becomes a costly operation that impacts our performance.

@shawnalpay
Copy link
Contributor

@raghibomar786 I just want to make sure I'm very clear on the use case. Which of the following are you requesting?

  1. The AWS identity column as it currently exists, knowing that it is not unique (or consistent) across files and therefore does not represent uniqueness.
  2. Net-new functionality in the form of a truly unique identifier that, to be clear, does not yet exist in any provider-native or FOCUS format.

Above, @kk09v was describing how he fulfills that second need by performing his own downstream hashing in a post-processing routine. In your original post, you've requested the AWS identity column, but I fear that it may not fulfill the use case as you've described it. If you are using identity/LineItemId for the purposes of allocation and then assuming uniqueness downstream, then you may encounter the same line item ID multiple times, and that could generate incorrect allocations.

Please let me know which of those two above needs best represents your ask. Thanks!

@raghibomar786
Copy link
Author

@shawnalpay The use-case 2 that you mentioned would be ideal. But the first use-case also works for us since the audit reports that we create stores information about the partition as well. So having a unique identifier within a partition will also solve our use case, though a universal global identifier would be even better for sure. As I mentioned in the comment above, hashing works but becomes an expensive solution for large datasets.

@jpradocueva
Copy link
Contributor

Summary from the Maintainers' call on Nov 25

Context:
This PR aims to add a requirement for a unique identifier for each row in the FOCUS dataset. The goal is to ensure data integrity and traceability, particularly in scenarios involving reconciliation, auditing, or merging datasets from multiple sources.

@raghibomar786
Copy link
Author

@jpradocueva Thanks for the context. Is there a way for us to access the PR so that we can remain updated with the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csp Cloud service providers dimensionality Fields that describe / group / filter metrics discussion topic Item or question to be discussed by the community needs backlog review Items to review with members and confirm whether to close or carry forward
Projects
Status: Triage
Development

No branches or pull requests

4 participants