Inconsistent mutability of `GCWorker` across VM interface #522

wks · 2022-01-14T05:27:07Z

Currently,

Collection<VM>::spawn_worker_thread provides a ctx: Option<&GCWorker<VM>> parameter, but
memory_manager::start_worker<VM> requires a worker: &mut GCWorker<VM> parameter.

The provided parameter is immutable, but the required parameter is mutable. Since spawn_worker_thread is supposed to create a thread that eventually calls start_worker, the ctx argument, if Some(worker) should be passed to start_worker. Now it is a compilation error.

In the following example, I am creating the GC threads directly in Rust, without involving VM-specific C functions.

impl Collection<Ruby> for VMCollection {
    // ...
    fn spawn_worker_thread(tls: VMThread, ctx: Option<&GCWorker<Ruby>>) {
        match ctx {
            None => {
                thread::Builder::new().name("MMTk Controller Thread".to_string()).spawn(move || {
                    memory_manager::start_control_collector(&SINGLETON, VMWorkerThread(tls))
                });
            }
            Some(worker) => { // worker is a &GCWorker<Ruby>
                thread::Builder::new().name("MMTk Worker Thread".to_string()).spawn(move || {
                    memory_manager::start_worker(VMWorkerThread(tls), worker, &SINGLETON)  // ERROR: expected &mut GCWorker<Ruby>
                });            
            }
        }
    }
    // ...
}

Currently, in mmtk-openjdk, the &GCWorker is force-cast to *mut GCWorker before passing to C++:

    fn spawn_worker_thread(tls: VMThread, ctx: Option<&GCWorker<OpenJDK>>) {
        let ctx_ptr = if let Some(r) = ctx {
            r as *const GCWorker<OpenJDK> as *mut GCWorker<OpenJDK>  // casting & to *const then to *mut
        } else {

Update

After discussion, we agree that the GC worker thread should own the GCWorker data structure.

The scheduler needs to communicate with GC workers, so shared data structures should be isolated, and should be accessed with proper synchornisation.

The stat field is written into by the GC worker during GC, but read from by the scheduler during harness_end (in mutator).

In stop-the-world GC, harness_end never happens when GC is running, so there should be no data race. The atomic_refcell crate provides a safe and cheap way to handle such case.
In concurrent GC, there is a data race. The current stat mechanism cannot handle such case well.

The text was updated successfully, but these errors were encountered:

wks · 2022-01-14T06:30:56Z

I think the problem is that the scheduler does not really own the worker.

If the scheduler owned the worker, it would have unrestricted access to its fields. For example, the scheduler should be able to read and modify its ordinal at will. The scheduler should be able to use the GCWorker::copy field (copy context) assuming it (the scheduler) is the only thread that uses it. But the scheduler never does any of them.

I think:

The worker thread should own the GCWorker instance. The thread is what really accesses the private fields of GCWorker.
The scheduler only communicates with the GCWorker in certain ways, through WorkerGroup::workers:
1. Spawning workers. (This is trivial)
2. In Prepare and Release, push PrepareCollector and ReleaseCollector work packets to each worker.
  - This involves the GCWorker::local_work_bucket field.
3. In enable_stat and statistics, enable and collect summary.
  - This involves the GCWorker::stat field.
4. Count parked workers.
  - This involves the GCWorker::parked field.

I suggest letting the worker thread own the GCWorker. This implies change the spawn_worker_thread parameter to Option<Box<GCWorker<VM>>>. (not a & or &mut any more).

To communicate with the scheduler, isolate the local_work_bucket, stat (the stat.enabled sub-field) and parked field into a separate struct, and it should be shared between the scheduler and the worker via an Arc container.

struct GcWorkerCommunicator {
    pub parked: AtomicBool,   // Atomic.
    pub local_work_bucket: WorkBucket<VM>,  // Already well-synchronized
    pub stat_enabled: AtomicBool, // Atomic, 
}

struct Scheduler {
  worker_communicators: Vec<Arc<GcWorkerCommunicator<VM>>>,
}

struct Worker {
  // ...
  communicator: Arc<GcWorkerCommunicator<VM>>,
  stat: WorkerLocalStat, // Needs to be synchornised
  // ...
}

And I am afraid the implementation of SchedulerStat::merge may have data race. The following is the current code:

    pub fn merge<C>(&mut self, stat: &WorkerLocalStat<C>) {  // NOTE: stat is a &
        // Merge work packet type ID to work packet name mapping
        for (id, name) in &stat.work_id_name_map {  // ERROR: We assume stat cannot be changed, but not really.
            self.work_id_name_map.insert(*id, *name);
        }

The stat parameter is &WorkerLocalStat<C>, and is immutably borrowed from a &GCWorker. But we know that we already force-cast the &GCWorker to a &mut GCWorker. This means there is another thread that may modify it, and that escaped Rust's ownership checker.

qinsoon · 2022-01-14T06:43:37Z

I totally agree with what you said about GC worker. The GC thread should own its GC worker, and separating the global part and the local part is a favorable pattern that we are using.

As for the merge(), I think probably there is no real data race for stop-the-world GC. The merge() function is called from harness_end() (which is application code). That basically means at that point, it is not in a GC, and all the GC workers (who have those &mut GCWorker) should be parked.

wks · 2022-01-14T07:58:11Z

The atomic_refcell crate could solve this problem. It provides a low-cost way to dynamically borrow the content of a shared data structure, and will panic if one thread attempts to borrow it while another thread is still holding a borrowing reference to it.

The following example shows how one thread can notify another thread to temporarily surrender its borrow so it can access the shared data mutably. If you set the nice variable to false, it will panic.

use std::env::args;
use std::sync::mpsc::{channel, Sender, Receiver};
use std::sync::Arc;
use atomic_refcell::AtomicRefCell;
use std::thread;

struct Wallet {
    money: AtomicRefCell<i32>,
}

struct Pedestrian {
    wallet: Arc<Wallet>,
    sender: Sender<Message>,
    receiver: Receiver<Message>,
}

struct Robber {
    wallet: Arc<Wallet>,
    sender: Sender<Message>,
    receiver: Receiver<Message>,
}

#[derive(PartialEq)]
enum Message {
    SurrenderYourMoney,
    DontHurtMe,
    Leave,
}

fn main() {
    let nice = true;

    let wallet = Arc::new(Wallet {
        money: AtomicRefCell::new(0),
    });
    
    let (rtp_sender, rtp_receiver) = channel();
    let (ptr_sender, ptr_receiver) = channel();

    let pedestrian = Pedestrian {
        wallet: wallet.clone(),
        sender: ptr_sender,
        receiver: rtp_receiver,
    };
    
    let robber = Robber {
        wallet: wallet.clone(),
        sender: rtp_sender,
        receiver: ptr_receiver,
    };
    
    let pedestrian_thread = thread::spawn(move || {
        let mut handle = pedestrian.wallet.money.borrow_mut();
        *handle = 100;

        let msg = pedestrian.receiver.recv().unwrap();
        assert!(msg == Message::SurrenderYourMoney);

        if nice {
            println!("Okay I'll give you!");
            drop(handle);
        } else {
            println!("No!");
        }
        
        pedestrian.sender.send(Message::DontHurtMe).unwrap();
        
        let msg = pedestrian.receiver.recv().unwrap();
        assert!(msg == Message::Leave);
    });
    
    let robber_thread = thread::spawn(move || {
        robber.sender.send(Message::SurrenderYourMoney).unwrap();
        let msg = robber.receiver.recv().unwrap();
        assert!(msg == Message::DontHurtMe);

        let mut handle = robber.wallet.money.borrow_mut();
        let my_money = *handle;
        *handle = 0;

        println!("I got ${}", my_money);

        robber.sender.send(Message::Leave).unwrap();
    });
    
    pedestrian_thread.join().unwrap();
    robber_thread.join().unwrap();
}

This commit fixes an inconsistency in the VMCollection interface where a GCWorker is exposed to the binding via `Collection::spawn_worker_thread` as a `&GCWorker`, but later received from the binding via `memory_manager::start_worker` as a `&mut GCWorker`. The root cause is because GCWorker is wrongly owned by the scheduler. We make each GC worker thread the owner of its `GCWorker` struct, and we pass the `GCWorker` struct across API boundary as owning `Box`, fixing the interface. We isolate the part of `GCWorker` shared with the GC scheduler into a `GCWorkerShared` struct, and ensure the fields are properly synchronized. Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`. `AtomicRefCell` allows it to be borrowed mutably by the GC worker and the mutator (via `harness_begin/end` methods) at different time. However, it is a known issue that in concurrent GC, it is possible for GC to happen when `harness_begin/end` is called. In that case, it will panic. Fixes: mmtk#522

This commit fixes an inconsistency in the VMCollection interface where a GCWorker is exposed to the binding via `Collection::spawn_worker_thread` as a `&GCWorker`, but later received from the binding via `memory_manager::start_worker` as a `&mut GCWorker`. The root cause is because GCWorker is wrongly owned by the scheduler. We make each GC worker thread the owner of its `GCWorker` struct, and we pass the `GCWorker` struct across API boundary as owning `Box`, fixing the interface. We isolate the part of `GCWorker` shared with the GC scheduler into a `GCWorkerShared` struct, and ensure the fields are properly synchronized. Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`. `AtomicRefCell` allows it to be borrowed mutably by the GC worker and the mutator (via `harness_begin/end` methods) at different time. However, it is a known issue that in concurrent GC, it is possible for GC to happen when `harness_begin/end` is called. In that case, it will panic. Fixes: #522

qinsoon added A-interface Area: Interface/API C-bug Category: Bug labels Jan 14, 2022

wks changed the title ~~spawn_worker_thread should pass &mut GCWorker~~ Inconsistent mutability of GCWorker across VM interface Jan 17, 2022

This was referenced Jan 17, 2022

Fix GCWorker ownership. #523

Merged

Fix GCWorker ownership mmtk/mmtk-openjdk#134

Merged

qinsoon closed this as completed in #523 Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent mutability of `GCWorker` across VM interface #522

Inconsistent mutability of `GCWorker` across VM interface #522

wks commented Jan 14, 2022 •

edited

Loading

wks commented Jan 14, 2022

qinsoon commented Jan 14, 2022

wks commented Jan 14, 2022 •

edited

Loading

Inconsistent mutability of GCWorker across VM interface #522

Inconsistent mutability of GCWorker across VM interface #522

Comments

wks commented Jan 14, 2022 • edited Loading

Update

wks commented Jan 14, 2022

qinsoon commented Jan 14, 2022

wks commented Jan 14, 2022 • edited Loading

Inconsistent mutability of `GCWorker` across VM interface #522

Inconsistent mutability of `GCWorker` across VM interface #522

wks commented Jan 14, 2022 •

edited

Loading

wks commented Jan 14, 2022 •

edited

Loading