Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorization in fuzz_m_ratio() #11

Open
luketudge opened this issue Jul 31, 2020 · 2 comments
Open

vectorization in fuzz_m_ratio() #11

luketudge opened this issue Jul 31, 2020 · 2 comments
Assignees

Comments

@luketudge
Copy link

luketudge commented Jul 31, 2020

The documentation for fuzz_m_ratio() says:

Arguments
a A character vector of items to match to b.

b A character vector of items to match to a.

method The method to use for fuzzy matching.

Value
Returns a score of same length as b, giving the proportional dissimilarity between a and b.

Problem

I am not sure how to interpret the fact that a is permitted to be a character vector with more than a single item.

I guess that the most common use case of the fuzz_ functions is with a as a single item. In this case, fuzz_m_ratio() works as expected and recycles a over all items in b:

b <- c("a", "b", "ab")
fuzz_m_ratio("a", b)
[1] 0.0000000 1.0000000 0.3333333

According to the standard R vectorization rules (see for example R for Data Science), if a is a vector with the same number of items as b, then the items from a and b should be paired up and the result should be the pairwise dissimilarity.

But this isn't what happens. For example:

fuzz_m_ratio(c("a", "b", "ab"), b)
[1] 1 1 1

Here, each item in a is identical to the corresponding item in b so we should get 0 0 0 dissimilarity.

It isn't the case that the function is simply reporting similarity instead of dissimilarity here. For example:

fuzz_m_ratio(c("a", "b", "x"), b)
[1] 1 1 1

Here some items from a match their counterparts in b and some don't, so we should get both 0s and 1s in the answer.

It looks as though the same inconsistencies occur for at least some of the other fuzz_ functions too.

Suggestions

I am not sure what is intended for the case of multiple items in a, and from the source code I can't quite work out what is actually being done. Possibly the function is only intended for single-item a. If this is the case, then maybe the documentation needs amending, and perhaps also an error or warning message if a has multiple items, since some R users will probably expect the standard vectorization rules.

On the other hand, if vectorization is wanted here, then maybe one general solution is to first create a non-vectorized version of the function that calculates the distance for a single pair of strings, for example:

fmr_nonvectorized <- function(str1, str2){
  lengths <- nchar(c(str1, str2))
  len_shortest <- min(lengths)
  char_matches <- unlist(strsplit(str1, ""))[1:len_shortest] == unlist(strsplit(str2, ""))[1:len_shortest]
  similarity <- 2 * sum(char_matches) / sum(lengths)
  return(1 - similarity)
}

fmr_nonvectorized("a", "ab")
[1] 0.3333333

And then apply it to two vectors using mapply(), for example:

fmr_vectorized <- function(a, b){
  distances <- mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE)
  return(distances)
}

This gives the standard R-style vectorization rules for the examples above:

fmr_vectorized("a", b)
[1] 0.0000000 1.0000000 0.3333333
fmr_vectorized(c("a", "b", "ab"), b)
[1] 0 0 0
fmr_vectorized(c("a", "b", "x"), b)
[1] 0 0 1
fmr_vectorized(c("x", "x", "x"), b)
[1] 1 1 1

And as a bonus this will give the standard warning if the longer of a and b is not an integer multiple of the other:

fmr_vectorized(c("a", "b"), b)
[1] 0.0000000 0.0000000 0.3333333
Warning message:
In mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE) :
  longer argument not a multiple of length of shorter
@mjwestgate
Copy link
Owner

Hi Luke,

Thanks for picking this up and apologies for not getting back to you sooner. You are indeed correct that this function is intended for the case where length(a) == 1. It honestly hadn't occurred to me to try and make it work for cases where both a and b have length > 1, but I can see why that would be a useful feature.

As for the error - that's a shame - obviously something got missed when we ported this from revtools to synthesisr. I'll have a look at that soon.

Martin

@mjwestgate mjwestgate self-assigned this Aug 27, 2020
@luketudge
Copy link
Author

Hi @mjwestgate

Thanks for taking a look at this! I hadn't realized this came from revtools. I see now that the same issue is already present there (though the revtools documentation for fuzz_functions does specify that a should be a string).

Let me know if I can be of any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants