Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity adapter re-sorts all ids (O(n*log(n)), instead of just inserting the new id into the "right" place (O(log(n))) #4252

Closed
dutziworks opened this issue Mar 5, 2024 · 17 comments

Comments

@dutziworks
Copy link
Contributor

Hey 👋

Why is it that when adding (or removing/updating/etc...) items inside an entity-adapter's state, entity-adapter will first push into the array, then run .sort(sortFn) on the entire array?

Wouldn't it make more sense to find the index where the new item should be at (O(log(n)), and splice the ids array?

@markerikson
Copy link
Collaborator

No specific reason other than it was the simplest thing and/or that's what the NgRx implementation was doing already.

My Replay teammate Brian Vaughn has written some utils for inserting items into a proper index here:

maybe we could swipe some of that

@plausible-phry
Copy link

It shouldn't matter much, tbh.

Re-sorting something that's already sorted except for one item should stay around O(n), and copying the array would also be O(n) in both cases, which brings both solution to about O(n) in both cases.

@dutziworks
Copy link
Contributor Author

Copying is by orders of magnitude cheaper than whatever sortFn does, which is up to the user anyhow.

@dutziworks
Copy link
Contributor Author

And I'm not sure where you got the impression that resorting an almost sorted array issues less than n*log(n) calls.

Try logging how many times sortFn is called when triggered on an already sorted array.

@phryneas
Copy link
Member

phryneas commented Mar 5, 2024

Sorting a sorted array will have exactly that amount of comparisons.

{let i=0; new Array(100).fill(0).map((_,i) => i).sort((a,b) => { console.log(a,b,++i); return a-b })}

logs 99 times.

Now we add 0 at the end, so it has to move all the way to the front of the array

{let i=0; let arr = new Array(100).fill(0).map((_,i) => i); arr.push(0); arr.sort((a,b) => { console.log(a,b,++i); return  a-b })}

logs 199 times. (Edit: I forgot the return earlier and said "100" here)

Of course, it depends on which algorithm is used in the engine (this was in Firefox), but here it's O(n).
Sorting an array in O(n*log(n)) is always the worst case, but sorting an almost-sorted array is usually a best case situation.
It's the same complexity as copying an array of n elements.

Since both solutions have to copy an array, we look at a comparison between O(n) + O(n) = O(n) and O(log(n)) + O(n) = O(n).

Of course it's not equal, and an insertion at the right place might be a tad faster, but it's the same order of magnitude.

Damn you're right, it goes into worst-case scenario. Would not, depending on the algorithm used, but it this case apparently it does :/

Edit2:
Numbers were just too small - we end up with O(2n)=O(n) here. It just looked like O(n*log(n)) because log(100)=2.

{let i=0; let arr = new Array(9999).fill(0).map((_,i) => i); arr.push(0); arr.sort((a,b) => { console.log(a,b,++i); return a-b }); arr}

logs 20003 times.

So I stay with the original answer ;)

PS: seems like most browsers nowadays use TimSort, with has a best case complexity of O(n) and a worst case of O(n*log(n)). Firefox seems to still use MergeSort, although the numbers here look too good for that.

@dutziworks
Copy link
Contributor Author

dutziworks commented Mar 10, 2024

Oh wow, you're right!

When dealing with 1000s of items + using immer (RTK) this does get a lot slower. Not sure if because of the equality function reading "through" immer or because of the big state change at the end + immer (which is unrelated to this "issue")

@Jackman3005
Copy link

With about 10k items this sort takes about ~3s anytime an entity is added/removed.

sortComparer: (a, b) => {
    // Keep the list of IDs sorted by `startedAt` DESC to avoid having to sort the list on each re-render of the Runs screen
    if (a === b) {
      return 0;
    } else if (!a.startedAt) {
      return 1;
    } else if (!b.startedAt) {
      return -1;
    }

    return b > a ? -1 : 1;
    // return new Date(b.startedAt).getTime() - new Date(a.startedAt).getTime();
  }

(Date line commented out to see if this was the issue, it was not).

@G2Jose
Copy link

G2Jose commented Apr 16, 2024

With about 10k items this sort takes about ~3s anytime an entity is added/removed

This is a significant amount of time, and I'm running into this problem on my application too. I've been considering refactoring my implementation so it doesn't rely on sortComparer, but I'm yet to figure out an elegant solution.

@markerikson
Copy link
Collaborator

markerikson commented Apr 16, 2024

For the record I'm very open to PRs to improve this, and I'd like to see it improved. I just haven't had much time for Redux maintenance work in the last couple months, and I have other priorities for the few spare minutes when I do get time to work on Redux stuff.

Note that I did link some potentially-useful insertion code up earlier in the thread, if someone would like to use that as a starting point for an implementation here.

(It would also be useful if we had some kind of perf benchmarks / tests added to the codebase as well, to help measure improvements and ensure we don't regress any improvements.)

@markerikson
Copy link
Collaborator

markerikson commented Apr 16, 2024

Took a few minutes to re-familiarize myself with some of this code.

There's a few major issues involved:

  • We use the same logic for sorting no matter what the overall operation was (add/update) or how many items (single/plural)
  • It's possible (if exceedingly unlikely) that an update may actually change the ID of an item, and we currently try to support that use case
  • The comparator is an arbitrary callback that could be comparing anything about a given pair of items, so we have to rerun that any time there's any update to any item, and re-sort all items.

Could someone do me a favor and provide a repro that shows this behaving really slowly, so I can do some checks on the perf?

I will say that this logic does seem somewhat inefficient - we're always creating an array, sorting it, and then checking to see if it seems different, and the array we create will be initially sorted based on insertion order of the items in the lookup table (and thus probably always be in badly sorted order):

 function resortEntities(state: R) {
    const allEntities = Object.values(state.entities) as T[]
    allEntities.sort(sort)

    const newSortedIds = allEntities.map(selectId)
    const { ids } = state

    if (!areArraysEqual(ids, newSortedIds)) {
      state.ids = newSortedIds
    }
  }

@G2Jose
Copy link

G2Jose commented Apr 17, 2024

Really appreciate you taking a look @markerikson.

Here's an example repo - https://github.com/G2Jose/redux-toolkit-performance-repro.

It contains

  • A very basic redux store using a redux toolkit slice
  • A single upsertOne reducer
  • Initial data (n = 100,000)
  • Jest code that dispatches an action that triggers this upsertOne reducer

With this setup I'm seeing:

sortComparer called: 1,529,276 times
Duration to upsert an item: 4,019 ms

In the real world I have an app that tracks workouts + workouts of friends. Over a long time period the list of sets can grow quite large. At which point inserting or updating any set becomes noticeably slow. I can of course refactor things so data that is likely to be updated (eg: a 'current workout') is stored in a different slice, but I was hoping for a simpler solution.

@markerikson
Copy link
Collaborator

Looking at this again briefly, it's going to be tricky to pull this off.

I think that if we could map over the already-sorted IDs to get the list of items to sort, there'd be significantly fewer comparisons. But, right now we use that resort() method kind of blindly, because it "conveniently" handles all the cases - added, updated, removed, etc.

In order to safely map over the IDs array, we would need to already have it updated with all the current IDs if any of those changed (added/updated/removed).

I'll try to think my way through this later.

@markerikson
Copy link
Collaborator

@G2Jose okay, I've got an initial PR up that seems to improve perf by about 50%. Could you try out the CI build as soon as it's done?

#4361

@Jackman3005
Copy link

Jackman3005 commented Apr 19, 2024

Hey @markerikson I ended up creating a new slice that I did the sorting in and took advantage of both the fact that the existing array of IDs was sorted and the incoming array of ids was also sorted (from the API in my case). But after researching it looks like merging of two sorted arrays is usually rather performant. In the scenario I gave above for the entity adapter, it was really slow when I had ~10k records and I was adding 1k more via upsertMany.

I imagine if you pre-sorted the IDs of the 1k in this scenario then performed a merge of two sorted arrays it might be most performant in this case. I created my own modified merge algorithm based on the knowledge that most of the time I am essentially just appending the ids I am receiving to the end of the existing list. I have shared this below, it's a bit messy and besides my own real-world analysis I do not have robust or theoretical proof that it's more performant on average than the standard merger of sorted arrays algorithms you can find in papers. Feel free to use this if it's helpful though, also open to feedback if you identify something bad/broken.

Note I did borrow some code that you linked to above for finding the insert index.

My Merger of two sorted arrays algorithm
/**
 * The goal of this function is to insert the new IDs into the sorted list of IDs in the most performant way as possible.
 * Because the existing list is already sorted and the new IDs are also sorted (from the API),
 * we can take advantage of this to avoid having to re-sort the entire list.
 * Additionally, we want to minimize the number of insertions
 * (calls to `.slice(...)`) into the list to avoid unnecessary overhead.
 */
function mergeTwoSortedArrays(
  existingSortedIds: string[],
  preSortedNewIds: string[],
  comparator: ComparisonFunction<string>
) {
  const firstInstanceId = preSortedNewIds[0];
  const lastInstanceId = preSortedNewIds[preSortedNewIds.length - 1];

  const startIndex = findInsertIndex(
    existingSortedIds,
    firstInstanceId,
    comparator
  );
  const endIndex = findInsertIndex(
    existingSortedIds,
    lastInstanceId,
    comparator,
    startIndex
  );

  const overlappingExistingIds = existingSortedIds.slice(startIndex, endIndex);
  let newIdIndexOfLastInsert = 0;
  let lastRelativeInsertIndex = 0;
  for (let i = 1; i < preSortedNewIds.length; i++) {
    const relativeInsertIndex = findInsertIndex(
      overlappingExistingIds,
      preSortedNewIds[i],
      comparator,
      lastRelativeInsertIndex
    );
    if (lastRelativeInsertIndex !== relativeInsertIndex) {
      const insertIndex =
        startIndex + newIdIndexOfLastInsert + lastRelativeInsertIndex;
      const arrayToInsert = preSortedNewIds.slice(newIdIndexOfLastInsert, i);
      existingSortedIds.splice(insertIndex, 0, ...arrayToInsert);
      newIdIndexOfLastInsert = i;
      lastRelativeInsertIndex = relativeInsertIndex;
    }
  }
  existingSortedIds.splice(
    startIndex + newIdIndexOfLastInsert + lastRelativeInsertIndex,
    0,
    ...preSortedNewIds.slice(newIdIndexOfLastInsert)
  );
}

const startedAtDescendingComparator =
  (state: QuestInstanceViewsState) => (_a: string, _b: string) => {
    const a = state.runListViewsById[_a]?.startedAt;
    const b = state.runListViewsById[_b]?.startedAt;
    if (a === b) {
      // Sort by id if `startedAt` is the same. Mimics the behavior of the API.
      return state.runListViewsById[_a]?.id > state.runListViewsById[_b]?.id
        ? -1
        : 1;
    } else if (!a) {
      return 1;
    } else if (!b) {
      return -1;
    }

    return a > b ? -1 : 1;
    // date comparison should not be needed since `startedAt` is a UTC ISO formatted string
    // return new Date(b.startedAt).getTime() - new Date(a.startedAt).getTime();
  };

/**
 * Fast insert of a new item into a sorted array.
 * Taken from: https://github.com/replayio/devtools/blob/ce97fc7abfd5ae03f767216efa44173209c27cdc/packages/replay-next/src/utils/array.ts#L131
 */
type ComparisonFunction<T> = (a: T, b: T) => number;
export function findInsertIndex<T>(
  sortedItems: T[],
  item: T,
  comparisonFunction: ComparisonFunction<T>,
  lowIndexOverride?: number
): number {
  let lowIndex = lowIndexOverride ?? 0;
  let highIndex = sortedItems.length;
  while (lowIndex < highIndex) {
    const middleIndex = (lowIndex + highIndex) >>> 1;
    const currentItem = sortedItems[middleIndex];
    if (comparisonFunction(item, currentItem) > 0) {
      lowIndex = middleIndex + 1;
    } else {
      highIndex = middleIndex;
    }
  }

  return lowIndex;
}

Edit:

  • I didn't end up doing a final performance analysis on this, because it was now so low that it was no longer the bottleneck. I imagine it is in the order of ms to tens of ms at most.
  • Note that duplicate ids are filtered before this call
  • In my case startedAt never changes, so it cannot be updated to something else causing the order to be wrong.

@markerikson
Copy link
Collaborator

@G2Jose @Jackman3005 I've updated the PR with several WIP variations. Currently it's using one that does some binary searches for insertions. Could you try that out and see how it performs?

@markerikson
Copy link
Collaborator

markerikson commented Apr 22, 2024

I think I also found that some of the existing entity adapter logic relied heavily on reads into Immer-proxied state, and when done in bulk, that was slowing things down (such as iterating state.ids in the equality check). It seems to be faster to do const ids = current(state.ids) and then compare. Ditto for doing const entities = current(state.entities). I also switched some if (id in state.entities) checks for Sets instead.

That seems to knock another 20%-ish off some of the times I'm seeing.

@markerikson
Copy link
Collaborator

Okay, this is out in https://github.com/reduxjs/redux-toolkit/releases/tag/v2.2.4 !

Going to close this, but I'm very interested in hearing feedback on how much this improves behavior in real-world apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants