Integrating custos (device: CPU) #234

Mec-iS · 2022-11-06T17:46:34Z

Mec-iS
Nov 6, 2022
Maintainer

custos and custos-math are two interesting device-centered libraries:

custos: "A minimal OpenCL, CUDA and host CPU array manipulation engine"
custos-math: "provides CUDA, OpenCL and CPU based matrix operations using custos"

@elftausend is creator of custos

It would be interesting to use custos functionalities, here some questions to start exploring this possibility:

what are the advantages of using custos' device let device = CPU::new(); in allocation compared to usual allocation? see example:

let device = CPU::new();
let a = Matrix::from((&device, (2, 3), [1., 2., 3., 4., 5., 6.,]));

how can we use custos Buffer for arrays? what would be the advantages?
can we create bindings for custos as for the ones for ndarrays?
how can we improve our linalg using custos-math? how can we provide these functionalities without impacting binary size (minimise dependencies)
what we can learn from the custos codebase to replicate in smartcore?

elftausend · 2022-11-07T19:02:23Z

elftausend
Nov 7, 2022

What are the advantages of using custos' device let device = CPU::new(); in allocation compared to usual allocation?

This keeps the doors open for supporting other accelerators (e.g. GPUs) at an advanced state of the operations implementer (in this case, probably linalg).
For instance, someone could just start with a CPU implementation, and if needed, add a CUDA implementation later:

let device = CPU::new();
let a = Matrix::from((&device, 2, 3, [1., 2., 3., 4., 5., 6.]));
let b = Matrix::from((&device, 2, 3, [1., 2., 3., 4., 5., 6.]));

let c_cpu = a + b;

let device = CUDA::new(0)?;
let a = Matrix::from((&device, 2, 3, [1., 2., 3., 4., 5., 6.]));
let b = Matrix::from((&device, 2, 3, [1., 2., 3., 4., 5., 6.]));

let c_cuda = a + b;
assert_eq!(c_cpu.read(), c_cuda.read());

It should be possibly to create a struct that returns a device depending on the compilation target and activated feature-set.

The specific device keeps track on allocations, which effectively creates a graph.
New nodes are added by calling let c = device.retrieve(len, (&a, &b));. (This allocates new memory, or returns previously allocated memory, if custos was configured to 'cache' old allocations). The node is stored in Buffer.
It is technically possible to allocate Buffer or Matrix via the typical way as a device can be stored in a static variable.
Look here.

let buf = buf![2, 3, 6].to_gpu();

How can we use custos Buffer for arrays? what would be the advantages?

Buffer is designed to be 'modular'. A Buffer can be allocated with any struct that implements the Device trait. This is also beneficial if accelerator support is necessary at some point.

pub struct Buffer<'a, T = f32, D: Device = CPU, const N: usize = 0> {
    pub ptr: D::Ptr<T, N>,
    pub len: usize,
    pub device: Option<&'a D>,
    pub flag: BufFlag,
    pub node: Node,
}

The const generic optionally enables a Buffer to be allocated on the stack.

0 replies

Mec-iS · 2022-11-08T17:24:48Z

Mec-iS
Nov 8, 2022
Maintainer Author

Thanks, this is really helpful.

I see a lot of interesting principles, we currently have a way of providing bindings for external libraries. For example the trait smartcore::linalg::arrays:Array could be implemented for Buffer. This way all the basic methods can work with a buffer type.
The idea would be to have:

let buf = buf![2, 3, 6];
let el = buf.get(0);
println!("{}", el);
>>> 2

let a = buf!([1, 2, 3]);
let v: Vec<i32> = a.iterator(0).map(|&v| v).collect();
assert_eq!(v, vec!(1, 2, 3));

let mut a = buf!([1, 2, 3]);
a.iterator_mut(0).for_each(|v| *v = 1);
assert_eq!(a, buf!([1, 1, 1]));

This may be possible with something like:

impl<T: Debug + Display + Copy + Sized> Array<T, (usize, usize)>
    for Buffer<T, D>
where
   D = Device
{
    fn get(&self, pos: (usize, usize)) -> &T {
        ... return element at position ...
    }

    fn shape(&self) -> (usize, usize) {
        ... return rows, columns shape ...
    }

    fn is_empty(&self) -> bool {
        ...
    }

    fn iterator<'b>(&'b self, axis: u8) -> Box<dyn Iterator<Item = &'b T> + 'b> {
        assert!(
            axis == 1 || axis == 0,
            "For two dimensional array `axis` should be either 0 or 1"
        );
        match axis {
            0 => Box::new(self.iter()),
            _ => Box::new(
                (0..self.ncols()).flat_map(move |c| (0..self.nrows()).map(move |r| &self[[r, c]])),
            ),
        }
    }
}

This is just a generic (probably not working) example, the same can be done for custos::Matrix with smartcore::linalg::arrays::Array2.

The part I cannot see at the moment is how to instantiate a Device "automatically" without the end user needing to instantiate explicitly. let device = CPU::new(); should be instantiated under-the-hood in a prelude or something like that? What is the cost of instantiating a device? does it makes sense to consider instantiating a device for particular operations? For example, in the case of expensive operations:

x.dot(&y)

it would make sense or provide advantages to have a "scoped" device? Like:

/// this is an high-level public API called by a client or end user
fn dot(&self, x: Buffer) {  // assuming self is a Buffer
    let device = CPU::new_with((self, x));   // "load" a device with existing buffers 
    device.dot().apply()    // this returns the result of a dot product between the buffers in the device
   // device is dropped
}

This may look trivial with dot-product but maybe for operations on trees could do to load a tree in a device and use it only for that tree?
I am trying to consider the possibility of pluggable features from custos.

We really want to develop things consistently from public interface to allocation level, sometimes there is tension between ergonomicity and "programmability" so better think things ahead.

Please follow up with your ideas.

0 replies

elftausend · 2022-11-09T20:17:42Z

elftausend
Nov 9, 2022

The part I cannot see at the moment is how to instantiate a Device "automatically" without the end user needing to instantiate explicitly
The only general way I could come up with is storing the device in a static variable and this is definitely not ideal. static-api

This should work too, however, this probably isn't what you want:

pub struct SomeMLAlgo<T, D: Device = custos::CPU> {
    device: D,
    a: Matrix<T, D>, // or whatever
    ...
}

impl<T, D: Device> SomeMLAlgo<T, D> {
    pub fn fit(a: &[T], ...) -> Self {
        let device = D::new();

        let a = Matrix::from((&device, ..., a));
        
        Self {
            device,
            a
        }
    }
}

What is the cost of instantiating a device?
Instantiating a CPU is cheap. Instantiating, for example, a CUDA device is expensive.

/// this is an high-level public API called by a client or end user
fn dot(&self, x: Buffer) {  // assuming self is a Buffer
    let device = CPU::new_with((self, x));   // "load" a device with existing buffers 
    device.dot().apply()    // this returns the result of a dot product between the buffers in the device
   // device is dropped
}

In this case I would probably do something like this:

fn dot(&self, x: Buffer) {
    self.device().dot(self, x)
}
``

1 reply

Mec-iS Nov 9, 2022
Maintainer Author

cool.

We are interested in CPU-support only at the moment. We will think about eventually at a smartcore-gpu library, we want to keep binaries small and it may make sense to have dedicated versions of the algos for CPU-single (single-thread), CPU-multi (Rust mutlithread) and only eventually GPUs (parallelized).

This will add overhead to serialization

pub struct SomeMLAlgo<T, D: Device = custos::CPU> {
    device: D,
    a: Matrix<T, D>, // or whatever
    ...
}

but we can just remove the generic D as we need a CPU only. so as per your suggestion it will just be:

impl<T> SomeMLAlgo<T> {
    pub fn fit(a: &[T], ...) -> Self {
        let device = D::new();
        let a = Matrix::from((&device, ..., a));
        
        Self {
            device,
            a
        }
    }
}

This could be a good branch for implementation ideas.

Mec-iS · 2022-11-09T23:55:36Z

Mec-iS
Nov 9, 2022
Maintainer Author

About GPUs: are you interested in doing something with triton? Do you see something like that to be a good fit for a Rust integration? If yes, how would you integrate kernels written in Triton into a Rust library?

cc: @morenol @Steboss89 please contribute to the conversation if you have any idea

0 replies

Steboss89 · 2022-11-10T12:08:11Z

Steboss89
Nov 10, 2022

@Mec-iS triton is really interesting, I love the idea, but IMHO it may be an overkill at the moment. In which cases are we going to use triton?

1 reply

Mec-iS Nov 10, 2022
Maintainer Author

yes GPUs are way down the road. One small example, we could port dot product into a triton kernel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating custos (device: CPU) #234

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Integrating custos (device: CPU) #234

Mec-iS Nov 6, 2022 Maintainer

Replies: 5 comments · 2 replies

elftausend Nov 7, 2022

Mec-iS Nov 8, 2022 Maintainer Author

elftausend Nov 9, 2022

Mec-iS Nov 9, 2022 Maintainer Author

Mec-iS Nov 9, 2022 Maintainer Author

Steboss89 Nov 10, 2022

Mec-iS Nov 10, 2022 Maintainer Author

Mec-iS
Nov 6, 2022
Maintainer

Replies: 5 comments 2 replies

elftausend
Nov 7, 2022

Mec-iS
Nov 8, 2022
Maintainer Author

elftausend
Nov 9, 2022

Mec-iS Nov 9, 2022
Maintainer Author

Mec-iS
Nov 9, 2022
Maintainer Author

Steboss89
Nov 10, 2022

Mec-iS Nov 10, 2022
Maintainer Author