Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch to version 0.4.0 #257

Merged
merged 29 commits into from
Mar 4, 2024
Merged

Patch to version 0.4.0 #257

merged 29 commits into from
Mar 4, 2024

Conversation

Mec-iS
Copy link
Collaborator

@Mec-iS Mec-iS commented Apr 3, 2023

Collecting changes for version 0.4.0

New expected behaviour

see #256

Change logs

  • Add Logistic Regression random test
  • Add Result type to DenseMatrix constructor by @EdmundsEcho
  • BREAKING CHANGES on DenseMatrix constructor, now constructor returns a Result

@codecov-commenter
Copy link

codecov-commenter commented Apr 3, 2023

Codecov Report

Attention: Patch coverage is 54.16667% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 45.49%. Comparing base (80a93c1) to head (3fcd722).

Files Patch % Lines
src/linalg/basic/matrix.rs 57.35% 29 Missing ⚠️
src/algorithm/neighbour/bbd_tree.rs 0.00% 2 Missing ⚠️
src/error/mod.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##           development     #257      +/-   ##
===============================================
- Coverage        46.44%   45.49%   -0.96%     
===============================================
  Files               85       85              
  Lines             7247     7254       +7     
===============================================
- Hits              3366     3300      -66     
- Misses            3881     3954      +73     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

morenol
morenol previously approved these changes Apr 3, 2023
Copy link
Collaborator

@morenol morenol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Mec-iS Mec-iS changed the title Patch 0.3.2 Patch 0.4.0 Apr 5, 2023
@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Apr 5, 2023

@morenol we need to fix Wasm tests

src/linalg/basic/matrix.rs Outdated Show resolved Hide resolved
src/linalg/basic/matrix.rs Outdated Show resolved Hide resolved
@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Apr 6, 2023

releasing a new minor for these changes is not completely justified. We need to add some more implementations to this to justify a minor release.

@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Apr 6, 2023

Also: probably, in accord with the library principles, probably it is better to panic early at initialization than returning a Result. smartcore is meant to be as stable as possible after compilation, using a Result may just move problems into runtime. So there is the option of just allow controlled panic!(..) instead of returning a Fail.

@EdmundsEcho
Copy link

EdmundsEcho commented Apr 6, 2023

To which function are you referring?

I get that if a config or something early is flawed, the app should exit with useful feedback. Elsewhere, the user can "opt-in" to a panic by using "unwrap".

Here is a link that might be useful https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html. I see a line between if "dangerous/harmful" then panic, otherwise Err. Err is a valid answer given the input. Panic is not generally expected behavior. Furthermore "fail early" is not what we are talking about, it's more "at the time a mistake is discovered" what does the lib return.

As a user of the library, I would expect to be able to send commands over time. A user (or service), might "get it wrong". Does that mean the user cannot "try again"?

In the event I run this in Tokio, I would not want the thread to panic (very bad and tough to navigate), but rather give me the answer - that can include: Err could not process b/c data was wrong, a match branch I can recover from.

Runtime errors in py is the reason I moved to Rust. Returning Err is not a runtime error like we experience in py. It's only a decision on what to return - app crash/panic or Err.

More generally about runtime errors: by definition of the compile-time checks in Rust, they don't equate to py (incomparable).

Right now the lib is fragile because it doesn't have strong typing through and through (likely mostly on the result side of the process). It will get better over time where the app does not instantiate a bad state (like what we have with DenseMatrix in the 0.4). Until then let the user know with an Err. For instance getting results is a fragile process.

What do you think?

@morenol
Copy link
Collaborator

morenol commented Apr 6, 2023

I also think that a Result is better than panic for this case. Also agree that we could delay the crates.io release until we have more things that we could add to a breaking change release

@EdmundsEcho
Copy link

I agree there is more that can be done for the next major release. I have some ideas for how to add to the changes for 0.4. Much of the ideas come from what Axum has done for http.

  1. Reporting

I tried to use the coefficients and intercept to output probabilities (not binary classes) and was surprised by how fragile the process was. e.g., get_row(0), get((0,0)) requires a lot of "under the hood" knowledge not to mention the gymnastics required to extract values from &X.

One way to go from a -> b -> c without the user being aware, is create what I call "transient types". Using the transient types you build implementations of std::convert::From. This way, the end user can do something like:

let coefficients: Vec<_> = lr.coefficients().into();
let intercept: f64 = lr.intercept().into();

// or 
let coefficients: Vec<_> = lr.coefficients().iter().collect();
let intercept: f64 = lr.intercept(); // casts to f64

There are other "best practices" for increasing the usability of each method by using traits such as AsRef, Borrow that I wonder might be worth looking through.

  1. Augment what is included in a report e.g., F stat. I'm not all familiar with the subtle differences between ML and traditional stat modeling. I come from the latter. The incremental effort to report on "common sense" "checks" that we should be interested in is likely good fodder for discussion. It might be a feature, or the user choosing to "see" more of what has already been computed in the default features.

  2. Move away from building your own underlying data structures for hosting and executing common math operations e.g., use ndarray. We are using argmin. We should do the same for the underlying data hosting structs.

  3. Implement "the builder" pattern, like what we experience in many Rust crates, e.g., build a http request. I see what we are doing here in a similar vein. This will augment the capacity to build a "default" for whatever analysis that is "more often" what you want. This would also require augmenting the documentation to fully explain the additional options (accelerated by pointing users to generically available descriptions of the concept on established stats/ml sites?).

  4. All in all, it might be useful to "draw a line" where you want to be able to rely on modules that emit Result vs other, internal private methods that only assert to convey a critical dependence, but that can be made "sound" due to internal, actively monitored assertions. In general, outside "monad", inside "pure".

For the next major after this, or the next after that :)), I would start thinking more about performance. I hit performance issues with 50k x 100 sample size. Part of this would include a review of when we borrow vs take ownership. Ironically, I wonder if there is too much effort to avoid copies. My main point is this, it's still too early to worry about performance in my view. Just note, we have room to improve when the time comes. Unlike python, it will actually make the code both better and more readable, not more convoluted :))

Finally, I've surveyed several packages to accomplish what I need for building stat models. It's a fragmented space with many "half-way" efforts. I like what you all have accomplished. I believe the core strength is assembly: data -> design -> results. I see Axum in smartcore. They assemble the best while focusing on a productive, type-safe, Rust user experience.

I hope this input is helpful. I spent a many years working in Rust. I'm happy to help however you all like.

@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Apr 7, 2023

Thanks for your ideas. I am glad you found smartcore interesting and you plan to contribute more!

I believe the library can be very useful in different domains and I am now focusing into using the library for applications to see what it can do. My idea is to keep the library different from the others on multiple dimensions. I think there is already a lot of effort into making things work for "larger" high-level use-cases and that side-tracking this approach can be very interesting in the long term for domains that are maybe not in the "big data" scenario; polars and linfa are already doing great on that, we don't want to overlap too much with their use cases; please see the CONTRIBUTING.
I am currently taking feedback from embedded developers and audio technologists mostly, focusing on real-time processing as an objective; so the use cases I have in mind are more in the context of low-level, small devices, highly portable, high-frequency computations. For example, this field of applications probably needs more something like the latest research on dimensionality reduction techniques than supporting billions of features, or else making ML algorithms more useful on tabular datasets. In this context user experience is a bit of a second-order problem but as far as resource usage stays and these levels (both in terms of time but maybe more important for small devices also space complexity) or improve I am in general OK with better APIs and more features.

It would be great to have more idiomatic Rust, I think let's start with the transient types and the builder pattern, after that we want probably to set up good profiling (we have a bench repo but it is very limited and difficult to run, it would be nice to have something like tarpaulin for profiling, that runs with CI/CD). Vlad, that created the library, is also of the idea that we should drop our custom traits system on arrays but this is something we can see later as far as we provide the bindings to ndarray.
Also it would be nice to have finally fixed predict_proba #211, but I don't know the functionality well enough to know if it is a problem with numerical stability of the library or the implementation itself.

Welcome! @EdmundsEcho

@EdmundsEcho
Copy link

To better understand the design intent was implemented, may I ask

  1. what the benefits of using a trait Array2d instead of a concrete type that works optimally with the solver? e.g., why not use DenseMatrix and convert whatever users want to this standard?
  2. are all of the methods in the Array-related traits used internally? Or are they there to support pre-processing?
  3. how do you see the line between pre-processing / post-processing and smartcore?
  4. why not use argmin?

@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Apr 8, 2023

To better understand the design intent was implemented, may I ask

1. what the benefits of using a trait `Array2d` instead of a concrete type that works optimally with the solver? e.g., why not use `DenseMatrix` and convert whatever users want to this standard?

2. are all of the methods in the `Array`-related traits used internally?  Or are they there to support pre-processing?

3. how do you see the line between pre-processing / post-processing and smartcore?

4. why not use argmin?
  1. I think it is because to take advantage of rust zero-cost abstractions and provide zero-copy slicing and iterators, also it is possible this way to implement other custom concrete types (like a SparseMatrix for example)
  2. They are used internally but they can also be used to define other types. I would keep the custom trait system as we may tend to have a no-dependency (or even a no_std) feature/fork at some point in the future
  3. the library always focused on linear algebra and matrices manipulation, the rest is left to the user. Probably we can add anything that is useful behind feature configurations but the core functionalities have to stay stand-alone for minimal installation.
  4. I guess it didn't exist at the time (2018), check discussions for other hypothesis of possible integrations. Again, using any library that is based on ndarray in the internals will make everything dependant on that, and I don't know if it's worth as we would just overlap other libraries' use cases and working patterns.

@Mec-iS Mec-iS marked this pull request as ready for review March 4, 2024 08:23
@Mec-iS Mec-iS changed the title Patch 0.4.0 Patch 0.3.3 Mar 4, 2024
@Mec-iS Mec-iS changed the title Patch 0.3.3 Patch to version 0.3.3 Mar 4, 2024
@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Mar 4, 2024

@morenol please take a look

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@Mec-iS Mec-iS changed the title Patch to version 0.3.3 Patch to version 0.4.0 Mar 4, 2024
@morenol
Copy link
Collaborator

morenol commented Mar 4, 2024

I think that we need to run cargo fmt

@Mec-iS Mec-iS requested a review from morenol March 4, 2024 13:29
@morenol morenol merged commit 239c004 into development Mar 4, 2024
13 of 14 checks passed
@Mec-iS Mec-iS deleted the patch-0.3.2 branch March 4, 2024 15:03
@Mec-iS Mec-iS mentioned this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants