Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/lists: introduce ifilter0, uniqBy, uniq, and fastUnique #119286

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bb010g
Copy link
Contributor

@bb010g bb010g commented Apr 13, 2021

Motivation for this change

TeXLive combine was really slow (at least when I first wrote this patch). uniqBy, uniq, and fastUnique are significantly more efficient alternatives to current standard library functions that should be used when possible.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

This change is Reviewable

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: TeX Issues regarding texlive and TeX in general 8.has: module (update) This PR changes an existing module in `nixos/` labels Apr 13, 2021
bb010g added a commit to bb010g/bb010g.nix that referenced this pull request Apr 13, 2021
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Apr 13, 2021
@bb010g
Copy link
Contributor Author

bb010g commented Apr 13, 2021

I considered using filter, but given that my clone of Nix has the following FIXME:

/* Filter a list using a predicate; that is, return a list containing
   every element from the list for which the predicate function
   returns true. */
static void prim_filter(EvalState & state, const Pos & pos, Value * * args, Value & v)
{
    state.forceFunction(*args[0], pos);
    state.forceList(*args[1], pos);

    // FIXME: putting this on the stack is risky.
    Value * vs[args[1]->listSize()];
    unsigned int k = 0;

I figured not relying on it would be a better idea for robustly handling huge lists, which is where you really want to use fastUnique. If you did filter over {value, isDup} pairs and then map, I think you'd get the same amount of list allocations but would avoid computing uniqLength separately.

lib/lists.nix Outdated
genList (n: if n == 0 then 0 else go (uniqNAt (n - 1) + 1)) uniqLength;
uniqNAt = elemAt uniqNList;
in
genList (n: listElemAt (uniqNAt n)) uniqLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function definitely needs some comments for how it works, but this is seriously impressive how you achieved O(n) like that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all the complexity here has been split out into the ifilter0 function, so you'll want to review the names & comments there.

fastUnique (a: b: a < b) [ 3 2 3 4 ]
=> [ 2 3 4 ]
*/
fastUnique = comparator: list: uniq (sort comparator list);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a more generic form by implementing this with

comparator: list: uniqBy (a: b: ! comparator a b) (sort comparator list)

This works because uniqBy's predicate is only ever called on increasing list values, which are sorted already (aka a <= b). With the additional check ! a < b we can then ensure that they're equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reminded of Rust's core::cmp::PartialEq & core::cmp::PartialOrd traits. Using inverted comparator as a uniqBy argument implies that, letting a < b = comparator a b, if a < b && !(b < a) then a == b. This is not the case if a and b are partially ordered by comparator and not totally ordered. Nix doesn't have NaN to make its floats not totally ordered, but the user-supplied data ordered by user-supplied comparator might not be totally ordered. However, this total ordering requirement can be disclosed in documentation.

I'm more worried about a user-supplied comparator being substantially slower than a user-supplied pred would be. If user-supplied data doesn't follow Nix == equality, then forcing explicit fallback to uniqBy pred (sort comparator list) (which isn't much longer) makes it clearer how uniqBy (a: b: !(comparator a b)) (sort comparator list) would repeatedly run a potentially-expensive comparator.

Expensive comparators might not be an issue in practice, though, and I defer here to those familiar with more of Nixpkgs.

Alternatively, fastUnique could be defined as fastUnique = list: uniq (sort lessThan list);, and we could force manual expansion for any other usage. I don't see much of a benefit to this though, given that builtins.sort takes a comparator. I'd like to mirror builtins.sort here. For comparison, builtins.elem does not take a pred, and uniq & unique reflect this. If the pattern of uniquniqBy was followed, there would be fastUniqueBy = pred: comparator: list: uniqBy pred (sort comparator list);. fastUniqueBy could shorten that earlier explicit expansion to fastUniqueBy (a: b: !(comparator a b)) comparator list, but I don't think defining this adds much?

bb010g added 4 commits April 14, 2021 16:52
An O(n) alternative to unique that only removes adjacent elements,
but can also take an equality predicate.

Named after the similar uniq(1).
A more efficient alternative to unique if sorted output is acceptable.
@bb010g bb010g changed the title lib/lists: introduce uniqBy, uniq, and fastUnique lib/lists: introduce ifilter0, uniqBy, uniq, and fastUnique Apr 15, 2021
@bb010g
Copy link
Contributor Author

bb010g commented Apr 15, 2021

I considered using filter, …

After starting to write a comment to this effect, explaining why filter was not used and providing a short implementation for a theoretical VLA-free filter, I realized that this filter should probably be implemented in lib/lists and then used by uniq. This turned into ifilter0, which justifies existing on a future VLA-free–filter Nix by taking an ipred that is passed the current index, similar to imap0. This also still allows uniqBy to avoid construction of {value, isDup} pairs (via an elemAt list call in ipred). ifilter0 is clearer to read than the previous uniqBy (or even similarly commented uniqBy), and uniqBy is much simpler & clearer to read.

@bb010g bb010g requested a review from infinisil April 19, 2021 22:07
let
# View `list` as a memoization.
# `nToE` maps `n` (a `list` index) to `e` (a `list` element). Memoized.
nToE = elemAt list;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining these elemAt functions doesn't do anything to memoize individual elements. You can inline this function to save a thunk allocation and some mental capacity. Same with nToKeep, keptNToNKept and eAt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elemAt list n doesn't allocate a thunk in the middle, despite being equivalent to (elemAt list) n?

Copy link
Member

@infinisil infinisil Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refreshing my past experience with the Nix evaluator, I created this wiki page to explain how thunks work in Nix: https://nixos.wiki/wiki/Nix_Evaluation_Performance#Thunks

nToKeepList = genList (n: ipred n (nToE n)) listLength;
# `keptNToEKept` maps `keptN` (a `keptList` index) to `eKept` (a kept
# `list` element). Our final result is this memoization, viewed as a list.
keptList = keptNToEKeptList;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just inline this, no need to define a variable the same as another variable.

; in genList keptNToNKeptUnmemoized keptLength;
keptNToEKeptList = let keptNToEKeptUnmemoized = keptN:
nToE (keptNToNKept keptN)
; in genList keptNToEKeptUnmemoized keptLength;
Copy link
Member

@infinisil infinisil May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these variable names and comments make it easier or harder to understand!

How about renaming

  • nToKeepList -> keepOrNot
  • keptLength -> resultLength
  • nToNKept -> nextResultIndex
  • keptNToNKeptList -> resultIndices
  • keptNToEKeptList -> result

Also, how about adding some visuals:

Example: Filtering for only uppercase letters in [ "b" "A" "f" "B" "E" "c" ]

       unfiltered list indices: 0 1 2 3 4 5
      unfiltered list elements: b A f B E c
       include condition holds: - Y - Y Y -  -> result length is 3
                                               
         filtered list indices:   0   1 2
 unfiltered indices to include:   1   3 4
                                  ^   ^ ^
    traverse condition list to find   Continue traversing condition list after the
the first Y, return the index of it   previously included index to find the next Y
                                 
        filtered list elements:   A   B E

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea with the naming was to keep it clear what type of data you were dealing with by using consistent "conversion" functions, especially when different index styles are being processed.

Visuals in the documentation is a good idea.

@stale
Copy link

stale bot commented Jan 6, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 6, 2022
@infinisil infinisil mentioned this pull request Feb 1, 2023
2 tasks
@infinisil
Copy link
Member

infinisil commented Feb 1, 2023

I created #214021 with a rebased and cleaned up version of just ifilter0, I hope that's okay with you @bb010g!

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 1, 2023
@infinisil infinisil mentioned this pull request Jul 24, 2023
12 tasks
@risicle
Copy link
Contributor

risicle commented Feb 27, 2024

I would really love to have at least the uniq functions from this - I find it puzzling that everyone seems to be happy with lib.unique.

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 19, 2024
@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Mar 20, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 20, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: TeX Issues regarding texlive and TeX in general 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants