Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PMM Deadlock #343

Open
jcjgraf opened this issue Nov 29, 2024 · 4 comments · May be fixed by #345
Open

[BUG] PMM Deadlock #343

jcjgraf opened this issue Nov 29, 2024 · 4 comments · May be fixed by #345
Assignees
Labels
bug Something isn't working

Comments

@jcjgraf
Copy link
Contributor

jcjgraf commented Nov 29, 2024

Describe the bug
There is a bug in the PMM where it locks-up (deadlock?). It happens when the allocator runs out of 4k frames and splits a higher-order frame. Maybe no 4k page was reserved for the new signling page?

To Reproduce

  1. The testcase test_kernel_task_func2 demonstrates the issue (Unittest That Stress-Tests the VMM and PMM  #341 )
  2. Set KT2_ROUNDS to 1 and KT2_CHUNKS to 1000 to allocate enough 4k frames such that they get depleted
  3. The testcase will either result in an assertion in the pmm getting triggered or by locking-up
  4. (optional) Fix the srand seed to one where you repeatedly observe the deadlock

Expected behavior
The testcase just passes. Throwing more memory at the issue, by increasing .qemu_config:QEMU_RAM, has seemingly no impact.

Props to Sandro Rüegge (@sparchatus) for helping me debugging this.

@jcjgraf jcjgraf added the bug Something isn't working label Nov 29, 2024
@jcjgraf jcjgraf changed the title [BUG] [BUG] PMM Deadlock Nov 29, 2024
@sparchatus
Copy link
Contributor

sparchatus commented Dec 2, 2024

Analysis (take with grain of salt, am not experienced with kernels)
The test calls vmap_kern_4k which takes vmap_lock lock (in vmap_4k). The paging code sometimes requires another call to vmap_kern_4k (when running out of frame array entries) through get_free_frame which is called in get_pgentry_mfn. This appears to cause the deadlock observed here.

If ktf runs in parallel on multiple cores there is another potential deadlock scenario:
One core calls vmap_kern_4k while another core calls get_free_frame. In the unlikely event that we also run out of array frames at the same time we end up with one core having the paging lock and wanting the frame lock while the second core is in the opposite situation.

Potential Solutions?

  • Release the paging lock when calling get_free_frame. This in turn might necessitate re-walking the page table. (same for get_free_frame which would need to free the pmm lock before mapping more frame array space)
  • Another approach could be to allocate the potentially required frames before taking the paging lock (or even having a set of "ready" frames in the paging which we can refill afterwards)

I can make a pull request for either solution. Since I have no experience in kernel development I'd appreciate some feedback on whether these are viable solutions before making the pull requests.

@wipawel
Copy link
Contributor

wipawel commented Dec 2, 2024

The analysis reasoning is correct I think. The PMM implementation tries to take care of this already (see the usage of MIN_FREE_FRAMES_THRESHOLD and MAX_FREE_FRAMES_THRESHOLD at

ktf/mm/pmm.c

Line 37 in e1658d8

#define MIN_FREE_FRAMES_THRESHOLD 2
). Apparently, it fails to do so correctly.

Hence, the second approach of making sure there is always available frame for the frames array is the way to go.

@sparchatus
Copy link
Contributor

The reservation looks good to me. The problem seems to be with mapping the newly acquired frame to use it as a frames array. This mapping might require another call to get_free_frame. There are two issues here:

  1. get_free_frame might have been called by paging in the first place so we cannot just do another call to paging due to the lock.
  2. The get_free_frame needs to realize it is in array acquisition stage and give out reserved frames without trying to refill again. However, to know this it needs to take the lock which is already taken.

Intuitively I'd try to solve both these issues using re-entrant locks but they don't exist yet in ktf and bring their own load of complexity...

@sparchatus
Copy link
Contributor

Alright, new suggestion :D

Paging wants to avoid pmm 4k frame refill (and array allocation) so it gets its own get_free_frame_norefill method to avoid this case. Further, there is a way for pmm to tell the paging that the paging request is because refill is currently running and paging should use get_free_frame_norefill_nolock instead to avoid a deadlock (essentially a cheap reentry). This should ensure consistency and avoid these pesky deadlocks... We still need to make sure that the refill does occur after or before paging.

@sparchatus sparchatus linked a pull request Dec 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants