vectorization in fuzz_m_ratio() #11

luketudge · 2020-07-31T15:17:56Z

The documentation for fuzz_m_ratio() says:

Arguments
a A character vector of items to match to b.

b A character vector of items to match to a.

method The method to use for fuzzy matching.

Value
Returns a score of same length as b, giving the proportional dissimilarity between a and b.

Problem

I am not sure how to interpret the fact that a is permitted to be a character vector with more than a single item.

I guess that the most common use case of the fuzz_ functions is with a as a single item. In this case, fuzz_m_ratio() works as expected and recycles a over all items in b:

b <- c("a", "b", "ab")
fuzz_m_ratio("a", b)

[1] 0.0000000 1.0000000 0.3333333

According to the standard R vectorization rules (see for example R for Data Science), if a is a vector with the same number of items as b, then the items from a and b should be paired up and the result should be the pairwise dissimilarity.

But this isn't what happens. For example:

fuzz_m_ratio(c("a", "b", "ab"), b)

[1] 1 1 1

Here, each item in a is identical to the corresponding item in b so we should get 0 0 0 dissimilarity.

It isn't the case that the function is simply reporting similarity instead of dissimilarity here. For example:

fuzz_m_ratio(c("a", "b", "x"), b)

[1] 1 1 1

Here some items from a match their counterparts in b and some don't, so we should get both 0s and 1s in the answer.

It looks as though the same inconsistencies occur for at least some of the other fuzz_ functions too.

Suggestions

I am not sure what is intended for the case of multiple items in a, and from the source code I can't quite work out what is actually being done. Possibly the function is only intended for single-item a. If this is the case, then maybe the documentation needs amending, and perhaps also an error or warning message if a has multiple items, since some R users will probably expect the standard vectorization rules.

On the other hand, if vectorization is wanted here, then maybe one general solution is to first create a non-vectorized version of the function that calculates the distance for a single pair of strings, for example:

fmr_nonvectorized <- function(str1, str2){
  lengths <- nchar(c(str1, str2))
  len_shortest <- min(lengths)
  char_matches <- unlist(strsplit(str1, ""))[1:len_shortest] == unlist(strsplit(str2, ""))[1:len_shortest]
  similarity <- 2 * sum(char_matches) / sum(lengths)
  return(1 - similarity)
}

fmr_nonvectorized("a", "ab")

[1] 0.3333333

And then apply it to two vectors using mapply(), for example:

fmr_vectorized <- function(a, b){
  distances <- mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE)
  return(distances)
}

This gives the standard R-style vectorization rules for the examples above:

fmr_vectorized("a", b)

[1] 0.0000000 1.0000000 0.3333333

fmr_vectorized(c("a", "b", "ab"), b)

[1] 0 0 0

fmr_vectorized(c("a", "b", "x"), b)

[1] 0 0 1

fmr_vectorized(c("x", "x", "x"), b)

[1] 1 1 1

And as a bonus this will give the standard warning if the longer of a and b is not an integer multiple of the other:

fmr_vectorized(c("a", "b"), b)

[1] 0.0000000 0.0000000 0.3333333
Warning message:
In mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE) :
  longer argument not a multiple of length of shorter

The text was updated successfully, but these errors were encountered:

mjwestgate · 2020-08-27T07:03:07Z

Hi Luke,

Thanks for picking this up and apologies for not getting back to you sooner. You are indeed correct that this function is intended for the case where length(a) == 1. It honestly hadn't occurred to me to try and make it work for cases where both a and b have length > 1, but I can see why that would be a useful feature.

As for the error - that's a shame - obviously something got missed when we ported this from revtools to synthesisr. I'll have a look at that soon.

Martin

luketudge · 2020-08-27T09:04:48Z

Hi @mjwestgate

Thanks for taking a look at this! I hadn't realized this came from revtools. I see now that the same issue is already present there (though the revtools documentation for fuzz_functions does specify that a should be a string).

Let me know if I can be of any help.

mjwestgate self-assigned this Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorization in fuzz_m_ratio() #11

vectorization in fuzz_m_ratio() #11

luketudge commented Jul 31, 2020 •

edited

Loading

mjwestgate commented Aug 27, 2020

luketudge commented Aug 27, 2020

vectorization in fuzz_m_ratio() #11

vectorization in fuzz_m_ratio() #11

Comments

luketudge commented Jul 31, 2020 • edited Loading

Problem

Suggestions

mjwestgate commented Aug 27, 2020

luketudge commented Aug 27, 2020

luketudge commented Jul 31, 2020 •

edited

Loading