Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Allowing seeds for randomness #138

Open
AtheMathmo opened this issue Sep 22, 2016 · 1 comment
Open

Allowing seeds for randomness #138

AtheMathmo opened this issue Sep 22, 2016 · 1 comment

Comments

@AtheMathmo
Copy link
Owner

We frequently use randomness throughout rusty-machine and in all cases default to the rand::thread_rng (except in the new Shuffler added by #135). It is very valuable for users to be able to set a seed to ensure that their models behave the same way on different runs.

It would be nice if we could capture this with minimal effect on the current API. It is probably a requirement that the models (and other components) own their own random number generators. We would likely need to add a new generic type parameter on the models:

pub struct MLModel<T1, T2, ..., Tn, R> where R: Rng {
    // All the previous fields
   rng: Rng,
}

The alternative is to use a Box for this field and provide a trait which allows the user to modify the generator for the model:

pub trait Randomizable {
    fn set_rng(&miut self, rng: Rng);

    fn rng(&self) -> Box<Rng>;
}

This approach will keep the API a little cleaner. We can also control access to shared code requiring randomness by using the Randomizable trait. For example we could wrap some of the rand_utils functions to work on an object which implements Randomizable. I'd still like the rand_utils functions to be accessed in their current state though.

Thoughts on this are very welcome. I suspect we will need to implement something to get a good idea of the benefits/drawbacks of any approach.

@dyule
Copy link

dyule commented Feb 21, 2017

Both approaches have the downside of requiring each module that uses it duplicate the code to do with managing the rng. I can't see any way out of that without some sort of factory for everything that uses randomness, along with lifetimes on the generated objects that would definitely be worse.

Of the two approaches, the second seems much easier for the end user, since it allows for a default of thread_rng() if they don't want to bother setting their own.

However, unless I'm misunderstanding box ownership (I might be, it always gets me a bit confused), the second option still results in ownership of the rng. And of course, it also results in dynamic dispatch, which is slower.

So, I think the choice boils down to either slightly more complexity for the user, or slightly longer runtimes.

I implemented a trial run on LDA, and I've discovered it causes some strange issues with mutability. That is, any method which uses randomness must be mutable, since all of the methods for generating randomness are mutable. I'm not sure this makes sense, so I've gone around it using RefCells, but I don't know how great that is either. Anyways, the relevant pieces are here:

pub trait Randomizable {
    fn set_rng(&mut self, rng: Box<Rng>);

    fn rng<'a>(&'a self) -> RefMut<'a, Box<Rng>>;
}

pub struct LDAFitter {
    iterations: usize,
    topic_count: usize,
    alpha: f64,
    beta: f64,
    rng: RefCell<Box<Rng>>
}

impl Default for LDAFitter {
    fn default() -> LDAFitter {
        LDAFitter {
            iterations: 30,
            topic_count: 10,
            alpha: 0.1,
            beta: 0.1,
            rng: RefCell::new(Box::new(thread_rng()))
        }
    }
}

impl Randomizable for LDAFitter {
    fn set_rng(&mut self, rng: Box<Rng>) {
        self.rng = RefCell::new(rng);
    }

    fn rng<'a>(&'a self) -> RefMut<'a, Box<Rng>> {
        self.rng.borrow_mut()
    }
}

So, this would have to be done for every module that depended on it. Alternately we could get around the need to use RefCells if trait definitions were changed to make fit mutable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants