Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove Identifier (once known as Identities) from codebase. #1845

Merged
merged 4 commits into from
Oct 28, 2022

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Oct 28, 2022

Identifiers were motivated by PartiQL, an experimental combinatorics language that uses set-like referential identity to track particles through a calculation. It would require our data to track a surrogate index through all of its operations, like a Pandas Index.

The Identities (Awkward v1) and Identifier (Awkward v2) were a "foot in the door" implementation to add such a feature. Originally, it was supposed to be implemented before the Awkward 1.0.0 release, but other features were more pressing. I left them in on the theory that it's easier to take them out than add them later, but it's been 3 years now and 2.0.0 is almost ready to ship—it's not going to happen. Moreover, 2.0.0 is almost ready to ship—removing them would mean changing the public API (albeit for the low-level layer, intended for downstream dependencies, rather than data analysts). So it must be done now.

Hopefully, this won't be too hard to merge with the other PRs. The most likely complication is that all Content subclasses and Form subclasses have 1 fewer argument: instead of "identifier, parameters", it's just "parameters". "None" is a likely value to be passed as identifier, and it's a likely value to be passed to parameters, so it shows up as "None, parameters" or "None, None".


📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/jpivarski-remove-identifiers-aka-identities/ once Read the Docs has finished building 🔨

@jpivarski jpivarski linked an issue Oct 28, 2022 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Merging #1845 (e2008d8) into main (6b5d46e) will increase coverage by 0.10%.
The diff coverage is 93.95%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/__init__.py 97.05% <ø> (-0.09%) ⬇️
src/awkward/_broadcasting.py 93.41% <ø> (ø)
src/awkward/_connect/pyarrow.py 88.46% <ø> (ø)
src/awkward/_util.py 82.15% <ø> (-0.18%) ⬇️
src/awkward/_v2.py 100.00% <ø> (ø)
src/awkward/operations/ak_full_like.py 100.00% <ø> (ø)
src/awkward/operations/ak_singletons.py 96.00% <ø> (ø)
src/awkward/operations/ak_with_name.py 100.00% <ø> (ø)
src/awkward/forms/unionform.py 79.67% <50.00%> (ø)
src/awkward/contents/emptyarray.py 72.28% <57.14%> (+0.23%) ⬆️
... and 38 more

@jpivarski
Copy link
Member Author

jpivarski commented Oct 28, 2022

Remove 1345 lines that were never tested (Identifiers were almost always None) and the coverage goes up by... 0.1%.

Merging #1845 (10bede1) into main (178b4e9) will increase coverage by 0.10%.
The diff coverage is 93.75%.

@jpivarski
Copy link
Member Author

The docs have (correctly) lost all references to identifier.

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a big PR! It will be nice to drop the identities mechanism given that we don't properly leverage it; less code to read and type in future!

I went line-by-line. I'll run a few greps offline to see if we've missed anything, but once that's done I'll approve!

src/awkward/_connect/numba/layout.py Outdated Show resolved Hide resolved
src/awkward/_connect/numba/layout.py Outdated Show resolved Hide resolved
@agoose77
Copy link
Collaborator

Also, I took the liberty of removing the v1 content from the Doxygen index, as you already have to touch that file in this PR.

@agoose77
Copy link
Collaborator

I checked NumpyArray, IndexedOptionArray, and ListOffsetArray, which I expected to be the most common layout types, and found no misuse.

Do you know what the is_identifier.match excerpt refers to in _prettyprint.py (used in a few places)?

@jpivarski
Copy link
Member Author

Do you know what the is_identifier.match excerpt refers to in _prettyprint.py (used in a few places)?

That checks to see if a field name of a record fits the regex that most languages use as identifiers—in particular, the Datashape language (which inherits it from whatever parser they're using). That's /[A-Za-z_][A-Za-z_0-9]*/.

This PR was not a sed substitution. I checked each usage of strings matching /identi/i (because I wanted to be sure there wasn't anything left-over from the "Identity"/"Identities" days).

I'll also simplify the two f-string substitutions. It used to be all format method calls because we were originally allowing for Python 2. So, little by little, I've been converting them to f-strings, and it seemed to be a good time/place to do that if there were identifiers to remove.

@agoose77
Copy link
Collaborator

Do you know what the is_identifier.match excerpt refers to in _prettyprint.py (used in a few places)?

That checks to see if a field name of a record fits the regex that most languages use as identifiers—in particular, the Datashape language (which inherits it from whatever parser they're using). That's /[A-Za-z_][A-Za-z_0-9]*/.

Fab, this was my read of things, but I never closely looked at identifiers before you removed them, so I wanted to clarify.

This PR was not a sed substitution. I checked each usage of strings matching /identi/i (because I wanted to be sure there wasn't anything left-over from the "Identity"/"Identities" days).

You can tell; lots of tricky things to find beyond a guided exploration. Even just reading all the files took a while, nice effort!

I'll also simplify the two f-string substitutions. It used to be all format method calls because we were originally allowing for Python 2. So, little by little, I've been converting them to f-strings, and it seemed to be a good time/place to do that if there were identifiers to remove.

I'm 100% on-board with f-strings. In these two rare cases, I'd either assign the long statements to local variables, or just use a multi-line string.format, hence the suggestion.

@jpivarski
Copy link
Member Author

In this case, I'm going to leave these strings as f-strings, but predefine the variables to substitute. That way, all of the Numba type strings will be generated in the same way.

@jpivarski jpivarski enabled auto-merge (squash) October 28, 2022 22:31
@jpivarski
Copy link
Member Author

Auto-merging.

Thanks!

@jpivarski jpivarski merged commit ccadeec into main Oct 28, 2022
@jpivarski jpivarski deleted the jpivarski/remove-identifiers-aka-identities branch October 28, 2022 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove Identifiers/identifier/has_identifiers from the codebase
2 participants