Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] sklearn-like Class-based C++ API for encapsulating state #456

Closed
cjnolet opened this issue Apr 8, 2019 · 7 comments
Closed

[FEA] sklearn-like Class-based C++ API for encapsulating state #456

cjnolet opened this issue Apr 8, 2019 · 7 comments
Labels
feature request New feature or request proposal Change current process or code

Comments

@cjnolet
Copy link
Member

cjnolet commented Apr 8, 2019

I've had offline discussions with a few individuals about this. One of the reasons the current C++ API does not use classes for encapsulating internal state is because we would like to support a C API. The ability to someday support Fortran bindings was another reason given.

Currently, the C++ API is composed of a bunch of stateless C++ functions, which require that the user maintain any necessary state.

As we're using C++ for our API, we can, at least at some level, benefit from the ability to define classes for encapsulating this state. We can even do this whilst keeping the current stateless C++ functions, by making the calls to those functions in higher-level classes, while keeping track of state for the users.

A good example of what I'm proposing is in the current UMAP implementation (which, mind you, has not yet been refactored to use the cumlHandle API). The current UMAP implementation uses a helper class with stateless functions to both fit() and transform() data.

These helper functions, named _fit() and _transform(), rely on an instance of kNN, as well as an object I've created to maintain the parameter settings. On top of these helper functions, I have built a UMAP class, which provides fit() and transform() functions. This class holds the kNN as internal state so that we can provide an API which more closely mimics that of sklearn.

By building this simple class on top of stateless helper functions, I have been able to continue providing those stateless helper functions as part of the API, while also providing an option for users to work with a much more simplified C++-style class-based design, which is also familiar since it mimics sklearn.

The Cython layer is also able to make use of the c++ class directly, as can be seen in the Cython definitions in the codebase.

Once the cumlHandle is supported in UMAP, it can be simply passed into the constructor of the UMAP class and held as an instance variable. This also makes it more simple for users, as the argument lists to the class functions have become much smaller.

@cjnolet cjnolet added feature request New feature or request ? - Needs Triage Need team to review and classify proposal Change current process or code and removed ? - Needs Triage Need team to review and classify labels Apr 8, 2019
@cjnolet cjnolet changed the title [DISCUSS] [FEA] Class-based C++ API for encapsulating state [DISCUSS] [FEA] sklearn-like Class-based C++ API for encapsulating state Apr 8, 2019
@cjnolet
Copy link
Member Author

cjnolet commented Apr 9, 2019

Tagging @teju85 @jirikraus @oyilmaz-nvidia

@jirikraus
Copy link
Contributor

@cjnolet Can you provide some illustrative code example (or point to existing code) for the concepts your are proposing? I would also like to see an example how to provide real C-Bindings on top of what you are proposing.

@angererc
Copy link

angererc commented Apr 9, 2019

The agreement was to keep the cuML interface "flat" (stateless functions) to make integrating cuML into various systems and languages easier. You mentioned Fortran, but there are many other cool things people are doing with cuML already. I believe this is a crucial property for adoption for many use-cases we are not even thinking of yet.

The idea for providing real C-Bindings was to have a separate repo (or at least folder) that wraps the flat C++ API into a C interface. This way, maintaining the C-Bindings is not the responsibility of core cuML, which makes sense because it's not vital for cuML to have C-Bindings, but C-Bindings are a convenience add-on we can provide.

I think we should follow the exact same approach for a stateful C++ interface, which adds no functionality but only convenience: the canonical cuML API is and remains stateless functions, C++ convenience classes can be added and maintained on the side. As you have described your code, @cjnolet, you already have the stateless functions on the bottom of the call stack and just add a more convenience-oriented C++ wrapper around that, so this should be fairly easy to do. I would not like to mix those two flavors of API together if we can keep them separated.

@teju85
Copy link
Member

teju85 commented Apr 9, 2019

@angererc the idea is to provide C-bindings through a separate .so file. But for simplicity of management, it is better to keep that as part of cuML repo itself. For example, Vinay's ongoing dbscan+cumlHandle PR #394 does something similar already, although we don't the C's .so file, yet.

@cjnolet overall this is a nice proposal.
Does this simplify cython wrapper? IMO, it's not a huge gain.
What benefit then stateful C++ design provides? Besides the obvious resource management and keeping our design simple, it vastly reduces the learning curve for our users! (think scikit-learn, but in C++)
Should we provide such a binding? Definitely yes!
When should we do it? IMO, we must have a solid customer use-case backing this up, to prioritize this one.
Should this be part of cuML release? Agree with @angererc . I too would vote for no. Having both stateful and stateless C++ APIs in the same library is more confusing than it being helpful.

@cjnolet
Copy link
Member Author

cjnolet commented Apr 9, 2019

@jirikraus:

I would also like to see an example how to provide real C-Bindings on top of what you are proposing.

I am basically proposing to build a C++ class-based API on top of the flat stateless c++ functions API that we already have. This would allow the stateless c++ API to feed both the C api, as has been agreed to already, as well as a more user-friendly sklearn-like API in the C++ layer.

@angererc:

The agreement was to keep the cuML interface "flat" (stateless functions) to make integrating cuML into various systems and languages easier.

As @teju85 pointed out, my proposal is basically to build the c++ class-based API on top of the stateless functions, in the same manner the c-based API has been proposed.

You mentioned Fortran, but there are many other cool things people are doing with cuML already. I believe this is a crucial property for adoption for many use-cases we are not even thinking of yet.

@jirikraus brought the Fortran feature to my attention. I'm excited to hear about other cool things people are doing with our current API! Can you elaborate?

I would not like to mix those two flavors of API together if we can keep them separated.

I completely agree, and this was actually the intention of this proposal- I believe both the c++ stateless functions API and a more sklearn-like API should exist as separate directories within the cuml repository.

IMO, even something as simple as C++ namespacing would serve this purpose very well, similar to what we are doing for ml-prims MLCommon and cuML ML.

@teju85:

Does this simplify cython wrapper? IMO, it's not a huge gain.

I've found that this depends on the amount of state that is needed to be kept in order to interact with the stateless functions in the C++ API. UMAP is a great example of this. In fact, since I'm using an instance of a kNN class there, the simplification is magnified even more so- if I was not able to use a kNN class, I would need to manage all of the state for the kNN on top of the state for the UMAP. This would all need to be done within the cython wrapper itself.

For this reason, I would say that using C++ classes does, in fact, drastically simplify the amount of state that needs to be managed by the user (or within cython). As we are building out our implementations to reuse as much as possible between the algorithms layer and the primitives layer, this problem is going to become more and more relevant.

When should we do it? IMO, we must have a solid customer use-case backing this up, to prioritize this one.

I've been looking over merge requests for new algorithms recently and finding that several developers are using classes by default. For instance, if we were to merge algorithms as implemented currently, we would have sklearn-like APIs for kNN, UMAP, SVM, Random Forest, Decision Trees, and KMeans!

Seeing that these other developers were already building out a class-based c++ API is what actually prompted this proposal ticket.

Should this be part of cuML release? Agree with @angererc . I too would vote for no. Having both stateful and stateless C++ APIs in the same library is more confusing than it being helpful.

I completely agree that keeping "core-cuml" (cuML folder in the codebase) separate from this more sklearn-like API is probably a good idea. Though, my vote is still for keeping them in different directories within the same codebase, and using namespaces to separate them.

@jirikraus
Copy link
Contributor

Thanks for the clarification @cjnolet . I have no objections adding a convince API to cuML as long as the maintain the possibility to provide C bindings as well.

@teju85
Copy link
Member

teju85 commented Apr 9, 2019

@cjnolet umap depending on knn and thereby reducing the amount of state needed is indeed a good example. I missed this one. Thanks for pointing it out.

@cjnolet cjnolet changed the title [DISCUSS] [FEA] sklearn-like Class-based C++ API for encapsulating state [FEA] sklearn-like Class-based C++ API for encapsulating state Apr 9, 2019
@cjnolet cjnolet closed this as completed Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request proposal Change current process or code
Projects
None yet
Development

No branches or pull requests

4 participants