Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch race condition in PostHogFileBackedQueue.deleteFiles #218

Merged
merged 3 commits into from
Oct 22, 2024

Conversation

jverkoey
Copy link
Contributor

@jverkoey jverkoey commented Oct 17, 2024

💡 Motivation and Context

In rare cases, deleteFiles(_:) can be invoked such that, between the check for items.isEmpty and the call to items.remove, the task gets pre-empted and items is mutated. Once the thread resumes, the .remove(at: 0) operation causes a crash.

This change includes the following two parts:

  1. Enable ReadWriteLock's mutate method to return a value from the mutation event.
  2. Wrap the item read + write events in a single atomic block using the new mutate method to return the popped filename.

Reported stack trace from my app, Sidecar:

Screenshot 2024-10-16 at 9 35 05 PM

#skip-changelog

💚 How did you test it?

Tested by using my app with this patch enabled and verifying that the files are deleted as expected.

📝 Checklist

  • I reviewed the submitted code.
  • I added tests to verify the changes.
  • I updated the docs if needed.
  • No breaking change or entry added to the changelog.

In rare cases, `deleteFiles(_:)` can be invoked such that, between the check for items.isEmpty and the call to items.remove, the task gets pre-empted and items is mutated. Once the thread resumes, the `.remove(at: 0)` operation causes a crash.
@jverkoey jverkoey requested a review from marandaneto as a code owner October 17, 2024 04:36
Copy link
Contributor

@ioannisj ioannisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! Yes, thanx for the catch!

Any multi-step operations on the collection types are not atomic as well (e.g array[i] += 1)

Left a couple of style comments - just personal preference so feel free to ignore/resolve those

PostHog/Utils/ReadWriteLock.swift Show resolved Hide resolved
PostHog/Utils/ReadWriteLock.swift Outdated Show resolved Hide resolved
PostHog/PostHogFileBackedQueue.swift Outdated Show resolved Hide resolved
@jverkoey
Copy link
Contributor Author

Style comments addressed :)

let removed = items.remove(at: 0) // We always remove from the top of the queue

deleteSafely(queue.appendingPathComponent(removed))
if let removed: String = _items.mutate({ items in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L49 does something similar, maybe this patch has to be applied there as well.

if items.isEmpty {
return nil
}
return items.remove(at: 0) // We always remove from the top of the queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you reproduce this issue since the whole block is locked with the isFlushingLock and isFlushing flag? Can you pass the specific steps to reproduce the issue so we can check if the very same issue happens in more places as well?
I don't see how the remove(at: 0) would fail if the internal items list isn't modified anywhere else outside of this lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call here is outside the isFlushingLock

func add(_ event: PostHogEvent) {
        if fileQueue.depth >= config.maxQueueSize {
            hedgeLog("Queue is full, dropping oldest event")
            // first is always oldest
            fileQueue.delete(index: 0)
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm in this case we might have a bug deleting the wrong file as well.
I think that part isn't locked to not block the calling thread on add, but we should do something about it then.
Can be a different issue/PR though, not blocking this one as its avoiding a crash already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we definitely shouldn't wait on a lock here. Thing is that we can't assume that this is aways called on main thread either.

Maybe it's best to try and recreate this crash with unit tests with background queue and DispatchQueue.concurrentPerform(iterations:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the thread is easy (Thread.isMainThread), but independently of the thread, the add method should be run within the dispatchQueue.async block and then it's ok to lock with isFlushingLock and await its availability.
Again, another PR, fixing the crash here is more important now, but a follow-up with the improvement above would be cool.

fileQueue.delete(index: 0)
is an edge case so I bet rarely happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I only got one crash report with this call stack.

Copy link
Contributor

@ioannisj ioannisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed request changes by mistake.

@ioannisj
Copy link
Contributor

Merging this one and I'll work on adding some unit tests to catch and fix these concurrency edge cases

@ioannisj ioannisj merged commit 766933a into PostHog:main Oct 22, 2024
5 of 6 checks passed
@ioannisj
Copy link
Contributor

@jverkoey released with 3.13.3. Thank you ❤️

@jverkoey
Copy link
Contributor Author

huzzah! thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants