Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a classify_all method #96

Closed
akprasad opened this issue Jan 1, 2024 · 7 comments
Closed

Support a classify_all method #96

akprasad opened this issue Jan 1, 2024 · 7 comments
Labels
chandas good first issue Good for newcomers

Comments

@akprasad
Copy link
Contributor

akprasad commented Jan 1, 2024

Problem

Chandas contains a classify method that returns a single result. But sometimes it is interesting to look at all possible results and let the user decide how to use them.

Proposed solution

Create a classify_all method that returns all padyas (vrttas + jatis) that don't have a None match.

@akprasad akprasad added good first issue Good for newcomers chandas labels Jan 1, 2024
@MSSRPRAD
Copy link
Contributor

MSSRPRAD commented Jan 1, 2024

How about changing the classify method entirely into a classify_vrtta method that returns a single result (which now also has a quality parameter). We can have an analogous classify_jati method for Jati metres.

pub fn classify_vrtta(&self, text: impl AsRef<str>) -> MatchResult {}
pub fn classify_jati(&self, text: impl AsRef<str>) -> MatchResult {}

classify_all can then just return the best n results of Partial Matches by updating them in some data structure while iterating:

pub fn classify_all(&self, text: impl AsRef<str>, best_n: usize) -> Vec<MatchResult> {}
pub struct MatchResult {
    matched: Option<Matched>,
    match_type: Vec<MatchType>,
    aksharas: Vec<Vec<Akshara>>,
}

pub enum Matched {
    Vrtta(Vrtta),
    Jati(Jati),
}

pub enum MatchType {
    Partial(usize), // Representing Quality of match
    Prefix,
    Pada,
    Full,
}

@akprasad

@akprasad
Copy link
Contributor Author

akprasad commented Jan 1, 2024

Thanks for the suggestions!

Having both classify_vrtta and classify_jati feels too complex -- an end user won't necessarily know whether the meter is a vrtta or jAti, and we should make their life easier with a simple classify endpoint so that they don't have to make two calls.

best_n is a nice idea but puts us in the position of having to decide what "best" means, but this seems dependent on the use case. classify_all without best_n is less sophisticated, but the return data is also more obvious and deterministic.

I do like the idea of putting usize and other fields on MatchType, e.g. Prefix(usize) to store the length of the prefix match. Partial is unclear to me. Perhaps MatchType should be split into a Scope (prefix, pada, full) and a Type (exact, EditDistance(usize), ...)

@MSSRPRAD
Copy link
Contributor

MSSRPRAD commented Jan 1, 2024

Having both classify_vrtta and classify_jati feels too complex -- an end user won't necessarily know whether the meter is a vrtta or jAti, and we should make their life easier with a simple classify endpoint so that they don't have to make two calls.

Can't classify_all to be that endpoint? best_n can be defaulted to 1 (and first the Jati match is checked before vrtta match)?

best_n is a nice idea but puts us in the position of having to decide what "best" means, but this seems dependent on the use case. classify_all without best_n is less sophisticated, but the return data is also more obvious and deterministic.

I felt when the database consists of several hundreds of schemes, it would be apt to ask for the best 5 or so than a match against every single scheme in the data. (If we're providing % matches it might match against many meters)

@akprasad
Copy link
Contributor Author

akprasad commented Jan 1, 2024

best_n can be defaulted to 1

Sadly Rust doesn't support default arguments, so there's no way to both have a default value and reduce friction for the user. (There are workarounds like having an Option<usize>, but then the caller needs to pass None explicitly.)

when the database consists of several hundreds of schemes

Good point!

So perhaps:

pub fn classify(&self, text: impl AsRef<str>) -> Matches {
  self.classify_all(text.as_ref(), 1)
}

pub fn classify_all(&self, text: impl AsRef<str>, n: usize) -> Matches { ... }

pub struct Match {
  // TODO: vrtta/jati, match type, etc.
}

pub struct Matches {
    matches: Vec<Match>,
    aksharas: Vec<Vec<Akshara>>,
}

@MSSRPRAD
Copy link
Contributor

MSSRPRAD commented Jan 2, 2024

pub fn classify(&self, text: impl AsRef) -> Matches {
self.classify_all(text.as_ref(), 1)
}

What kind of response struct should be returned? If classify is just returning one response, should it be the one giving Pada match / Prefix Match / None(edit_distance or any other metric) in case there is not Full match?

@akprasad
Copy link
Contributor Author

akprasad commented Jan 2, 2024

How about something like:

pub enum Padya {
  Vrtta(Vrtta),
  Jati(Jati),
}

// I don't like this name
pub enum MatchStrategy {
  Exact,
  EditDistance(usize)
}

pub enum MatchScope {
  None,
  Prefix,
  Pada,
  Full,
}

pub struct Match {
  pada: Padya,
  strategy: MatchStrategy,
  scope: MatchScope,
}

pub struct Matches {
  matches: Vec<Match>,
  aksharas: Vec<Vec<Akshara>>,
}

Then both methods can return a Matches. classify can return a Matches where the matches field has length at most 1.

@akprasad
Copy link
Contributor Author

akprasad commented Nov 4, 2024

Still tweaking the output format, but I have a workable classify_all as a starting point that I'll push out soon.

@akprasad akprasad closed this as completed Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chandas good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants