-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch race condition in PostHogFileBackedQueue.deleteFiles #218
Conversation
In rare cases, `deleteFiles(_:)` can be invoked such that, between the check for items.isEmpty and the call to items.remove, the task gets pre-empted and items is mutated. Once the thread resumes, the `.remove(at: 0)` operation causes a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there! Yes, thanx for the catch!
Any multi-step operations on the collection types are not atomic as well (e.g array[i] += 1)
Left a couple of style comments - just personal preference so feel free to ignore/resolve those
Style comments addressed :) |
let removed = items.remove(at: 0) // We always remove from the top of the queue | ||
|
||
deleteSafely(queue.appendingPathComponent(removed)) | ||
if let removed: String = _items.mutate({ items in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L49 does something similar, maybe this patch has to be applied there as well.
if items.isEmpty { | ||
return nil | ||
} | ||
return items.remove(at: 0) // We always remove from the top of the queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you reproduce this issue since the whole block is locked with the isFlushingLock
and isFlushing
flag? Can you pass the specific steps to reproduce the issue so we can check if the very same issue happens in more places as well?
I don't see how the remove(at: 0)
would fail if the internal items
list isn't modified anywhere else outside of this lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this call here is outside the isFlushingLock
func add(_ event: PostHogEvent) {
if fileQueue.depth >= config.maxQueueSize {
hedgeLog("Queue is full, dropping oldest event")
// first is always oldest
fileQueue.delete(index: 0)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm in this case we might have a bug deleting the wrong file as well.
I think that part isn't locked to not block the calling thread on add, but we should do something about it then.
Can be a different issue/PR though, not blocking this one as its avoiding a crash already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we definitely shouldn't wait on a lock here. Thing is that we can't assume that this is aways called on main thread either.
Maybe it's best to try and recreate this crash with unit tests with background queue and DispatchQueue.concurrentPerform(iterations:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the thread is easy (Thread.isMainThread
), but independently of the thread, the add method should be run within the dispatchQueue.async
block and then it's ok to lock with isFlushingLock
and await its availability.
Again, another PR, fixing the crash here is more important now, but a follow-up with the improvement above would be cool.
posthog-ios/PostHog/PostHogQueue.swift
Line 212 in 1bb1709
fileQueue.delete(index: 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I only got one crash report with this call stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed request changes by mistake.
Merging this one and I'll work on adding some unit tests to catch and fix these concurrency edge cases |
huzzah! thank you! |
💡 Motivation and Context
In rare cases,
deleteFiles(_:)
can be invoked such that, between the check for items.isEmpty and the call to items.remove, the task gets pre-empted and items is mutated. Once the thread resumes, the.remove(at: 0)
operation causes a crash.This change includes the following two parts:
mutate
method to return a value from the mutation event.Reported stack trace from my app, Sidecar:
#skip-changelog
💚 How did you test it?
Tested by using my app with this patch enabled and verifying that the files are deleted as expected.
📝 Checklist