Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent mutability of GCWorker across VM interface #522

Closed
wks opened this issue Jan 14, 2022 · 3 comments · Fixed by #523 or mmtk/mmtk-openjdk#134
Closed

Inconsistent mutability of GCWorker across VM interface #522

wks opened this issue Jan 14, 2022 · 3 comments · Fixed by #523 or mmtk/mmtk-openjdk#134
Labels
A-interface Area: Interface/API C-bug Category: Bug

Comments

@wks
Copy link
Collaborator

wks commented Jan 14, 2022

Currently,

  • Collection<VM>::spawn_worker_thread provides a ctx: Option<&GCWorker<VM>> parameter, but
  • memory_manager::start_worker<VM> requires a worker: &mut GCWorker<VM> parameter.

The provided parameter is immutable, but the required parameter is mutable. Since spawn_worker_thread is supposed to create a thread that eventually calls start_worker, the ctx argument, if Some(worker) should be passed to start_worker. Now it is a compilation error.

In the following example, I am creating the GC threads directly in Rust, without involving VM-specific C functions.

impl Collection<Ruby> for VMCollection {
    // ...
    fn spawn_worker_thread(tls: VMThread, ctx: Option<&GCWorker<Ruby>>) {
        match ctx {
            None => {
                thread::Builder::new().name("MMTk Controller Thread".to_string()).spawn(move || {
                    memory_manager::start_control_collector(&SINGLETON, VMWorkerThread(tls))
                });
            }
            Some(worker) => { // worker is a &GCWorker<Ruby>
                thread::Builder::new().name("MMTk Worker Thread".to_string()).spawn(move || {
                    memory_manager::start_worker(VMWorkerThread(tls), worker, &SINGLETON)  // ERROR: expected &mut GCWorker<Ruby>
                });            
            }
        }
    }
    // ...
}

Currently, in mmtk-openjdk, the &GCWorker is force-cast to *mut GCWorker before passing to C++:

    fn spawn_worker_thread(tls: VMThread, ctx: Option<&GCWorker<OpenJDK>>) {
        let ctx_ptr = if let Some(r) = ctx {
            r as *const GCWorker<OpenJDK> as *mut GCWorker<OpenJDK>  // casting & to *const then to *mut
        } else {

Update

After discussion, we agree that the GC worker thread should own the GCWorker data structure.

The scheduler needs to communicate with GC workers, so shared data structures should be isolated, and should be accessed with proper synchornisation.

The stat field is written into by the GC worker during GC, but read from by the scheduler during harness_end (in mutator).

  • In stop-the-world GC, harness_end never happens when GC is running, so there should be no data race. The atomic_refcell crate provides a safe and cheap way to handle such case.
  • In concurrent GC, there is a data race. The current stat mechanism cannot handle such case well.
@qinsoon qinsoon added A-interface Area: Interface/API C-bug Category: Bug labels Jan 14, 2022
@wks
Copy link
Collaborator Author

wks commented Jan 14, 2022

I think the problem is that the scheduler does not really own the worker.

If the scheduler owned the worker, it would have unrestricted access to its fields. For example, the scheduler should be able to read and modify its ordinal at will. The scheduler should be able to use the GCWorker::copy field (copy context) assuming it (the scheduler) is the only thread that uses it. But the scheduler never does any of them.

I think:

  • The worker thread should own the GCWorker instance. The thread is what really accesses the private fields of GCWorker.
  • The scheduler only communicates with the GCWorker in certain ways, through WorkerGroup::workers:
    1. Spawning workers. (This is trivial)
    2. In Prepare and Release, push PrepareCollector and ReleaseCollector work packets to each worker.
      • This involves the GCWorker::local_work_bucket field.
    3. In enable_stat and statistics, enable and collect summary.
      • This involves the GCWorker::stat field.
    4. Count parked workers.
      • This involves the GCWorker::parked field.

I suggest letting the worker thread own the GCWorker. This implies change the spawn_worker_thread parameter to Option<Box<GCWorker<VM>>>. (not a & or &mut any more).

To communicate with the scheduler, isolate the local_work_bucket, stat (the stat.enabled sub-field) and parked field into a separate struct, and it should be shared between the scheduler and the worker via an Arc container.

struct GcWorkerCommunicator {
    pub parked: AtomicBool,   // Atomic.
    pub local_work_bucket: WorkBucket<VM>,  // Already well-synchronized
    pub stat_enabled: AtomicBool, // Atomic, 
}

struct Scheduler {
  worker_communicators: Vec<Arc<GcWorkerCommunicator<VM>>>,
}

struct Worker {
  // ...
  communicator: Arc<GcWorkerCommunicator<VM>>,
  stat: WorkerLocalStat, // Needs to be synchornised
  // ...
}

And I am afraid the implementation of SchedulerStat::merge may have data race. The following is the current code:

    pub fn merge<C>(&mut self, stat: &WorkerLocalStat<C>) {  // NOTE: stat is a &
        // Merge work packet type ID to work packet name mapping
        for (id, name) in &stat.work_id_name_map {  // ERROR: We assume stat cannot be changed, but not really.
            self.work_id_name_map.insert(*id, *name);
        }

The stat parameter is &WorkerLocalStat<C>, and is immutably borrowed from a &GCWorker. But we know that we already force-cast the &GCWorker to a &mut GCWorker. This means there is another thread that may modify it, and that escaped Rust's ownership checker.

@qinsoon
Copy link
Member

qinsoon commented Jan 14, 2022

I totally agree with what you said about GC worker. The GC thread should own its GC worker, and separating the global part and the local part is a favorable pattern that we are using.

As for the merge(), I think probably there is no real data race for stop-the-world GC. The merge() function is called from harness_end() (which is application code). That basically means at that point, it is not in a GC, and all the GC workers (who have those &mut GCWorker) should be parked.

@wks
Copy link
Collaborator Author

wks commented Jan 14, 2022

The atomic_refcell crate could solve this problem. It provides a low-cost way to dynamically borrow the content of a shared data structure, and will panic if one thread attempts to borrow it while another thread is still holding a borrowing reference to it.

The following example shows how one thread can notify another thread to temporarily surrender its borrow so it can access the shared data mutably. If you set the nice variable to false, it will panic.

use std::env::args;
use std::sync::mpsc::{channel, Sender, Receiver};
use std::sync::Arc;
use atomic_refcell::AtomicRefCell;
use std::thread;

struct Wallet {
    money: AtomicRefCell<i32>,
}

struct Pedestrian {
    wallet: Arc<Wallet>,
    sender: Sender<Message>,
    receiver: Receiver<Message>,
}

struct Robber {
    wallet: Arc<Wallet>,
    sender: Sender<Message>,
    receiver: Receiver<Message>,
}

#[derive(PartialEq)]
enum Message {
    SurrenderYourMoney,
    DontHurtMe,
    Leave,
}

fn main() {
    let nice = true;

    let wallet = Arc::new(Wallet {
        money: AtomicRefCell::new(0),
    });
    
    let (rtp_sender, rtp_receiver) = channel();
    let (ptr_sender, ptr_receiver) = channel();

    let pedestrian = Pedestrian {
        wallet: wallet.clone(),
        sender: ptr_sender,
        receiver: rtp_receiver,
    };
    
    let robber = Robber {
        wallet: wallet.clone(),
        sender: rtp_sender,
        receiver: ptr_receiver,
    };
    
    let pedestrian_thread = thread::spawn(move || {
        let mut handle = pedestrian.wallet.money.borrow_mut();
        *handle = 100;

        let msg = pedestrian.receiver.recv().unwrap();
        assert!(msg == Message::SurrenderYourMoney);

        if nice {
            println!("Okay I'll give you!");
            drop(handle);
        } else {
            println!("No!");
        }
        
        pedestrian.sender.send(Message::DontHurtMe).unwrap();
        
        let msg = pedestrian.receiver.recv().unwrap();
        assert!(msg == Message::Leave);
    });
    
    let robber_thread = thread::spawn(move || {
        robber.sender.send(Message::SurrenderYourMoney).unwrap();
        let msg = robber.receiver.recv().unwrap();
        assert!(msg == Message::DontHurtMe);

        let mut handle = robber.wallet.money.borrow_mut();
        let my_money = *handle;
        *handle = 0;

        println!("I got ${}", my_money);

        robber.sender.send(Message::Leave).unwrap();
    });
    
    pedestrian_thread.join().unwrap();
    robber_thread.join().unwrap();
}

@wks wks changed the title spawn_worker_thread should pass &mut GCWorker Inconsistent mutability of GCWorker across VM interface Jan 17, 2022
wks added a commit to wks/mmtk-core that referenced this issue Jan 17, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 17, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 18, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 19, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 19, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 24, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 24, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
wks added a commit to wks/mmtk-core that referenced this issue Jan 26, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: mmtk#522
qinsoon pushed a commit that referenced this issue Jan 28, 2022
This commit fixes an inconsistency in the VMCollection interface where a
GCWorker is exposed to the binding via `Collection::spawn_worker_thread`
as a `&GCWorker`, but later received from the binding via
`memory_manager::start_worker` as a `&mut GCWorker`.  The root cause is
because GCWorker is wrongly owned by the scheduler.

We make each GC worker thread the owner of its `GCWorker` struct, and we
pass the `GCWorker` struct across API boundary as owning `Box`, fixing
the interface.

We isolate the part of `GCWorker` shared with the GC scheduler into a
`GCWorkerShared` struct, and ensure the fields are properly
synchronized.

Particularly, we now use `AtomicRefCell` for `WorkerLocalStat`.
`AtomicRefCell` allows it to be borrowed mutably by the GC worker and
the mutator (via `harness_begin/end` methods) at different time.
However, it is a known issue that in concurrent GC, it is possible for
GC to happen when `harness_begin/end` is called.  In that case, it will
panic.

Fixes: #522
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interface Area: Interface/API C-bug Category: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants