You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Arguments
a A character vector of items to match to b.
b A character vector of items to match to a.
method The method to use for fuzzy matching.
Value
Returns a score of same length as b, giving the proportional dissimilarity between a and b.
Problem
I am not sure how to interpret the fact that a is permitted to be a character vector with more than a single item.
I guess that the most common use case of the fuzz_ functions is with a as a single item. In this case, fuzz_m_ratio() works as expected and recycles a over all items in b:
b<- c("a", "b", "ab")
fuzz_m_ratio("a", b)
[1] 0.0000000 1.0000000 0.3333333
According to the standard R vectorization rules (see for example R for Data Science), if a is a vector with the same number of items as b, then the items from a and b should be paired up and the result should be the pairwise dissimilarity.
But this isn't what happens. For example:
fuzz_m_ratio(c("a", "b", "ab"), b)
[1] 1 1 1
Here, each item in a is identical to the corresponding item in b so we should get 0 0 0 dissimilarity.
It isn't the case that the function is simply reporting similarity instead of dissimilarity here. For example:
fuzz_m_ratio(c("a", "b", "x"), b)
[1] 1 1 1
Here some items from a match their counterparts in b and some don't, so we should get both 0s and 1s in the answer.
It looks as though the same inconsistencies occur for at least some of the other fuzz_ functions too.
Suggestions
I am not sure what is intended for the case of multiple items in a, and from the source code I can't quite work out what is actually being done. Possibly the function is only intended for single-item a. If this is the case, then maybe the documentation needs amending, and perhaps also an error or warning message if a has multiple items, since some R users will probably expect the standard vectorization rules.
On the other hand, if vectorization is wanted here, then maybe one general solution is to first create a non-vectorized version of the function that calculates the distance for a single pair of strings, for example:
And then apply it to two vectors using mapply(), for example:
fmr_vectorized<-function(a, b){
distances<- mapply(fmr_nonvectorized, a, b, USE.NAMES=FALSE)
return(distances)
}
This gives the standard R-style vectorization rules for the examples above:
fmr_vectorized("a", b)
[1] 0.0000000 1.0000000 0.3333333
fmr_vectorized(c("a", "b", "ab"), b)
[1] 0 0 0
fmr_vectorized(c("a", "b", "x"), b)
[1] 0 0 1
fmr_vectorized(c("x", "x", "x"), b)
[1] 1 1 1
And as a bonus this will give the standard warning if the longer of a and b is not an integer multiple of the other:
fmr_vectorized(c("a", "b"), b)
[1] 0.0000000 0.0000000 0.3333333
Warning message:
In mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE) :
longer argument not a multiple of length of shorter
The text was updated successfully, but these errors were encountered:
Thanks for picking this up and apologies for not getting back to you sooner. You are indeed correct that this function is intended for the case where length(a) == 1. It honestly hadn't occurred to me to try and make it work for cases where both a and b have length > 1, but I can see why that would be a useful feature.
As for the error - that's a shame - obviously something got missed when we ported this from revtools to synthesisr. I'll have a look at that soon.
Thanks for taking a look at this! I hadn't realized this came from revtools. I see now that the same issue is already present there (though the revtools documentation for fuzz_functions does specify that a should be a string).
The documentation for
fuzz_m_ratio()
says:Problem
I am not sure how to interpret the fact that
a
is permitted to be a character vector with more than a single item.I guess that the most common use case of the
fuzz_
functions is witha
as a single item. In this case,fuzz_m_ratio()
works as expected and recyclesa
over all items inb
:According to the standard R vectorization rules (see for example R for Data Science), if
a
is a vector with the same number of items asb
, then the items froma
andb
should be paired up and the result should be the pairwise dissimilarity.But this isn't what happens. For example:
Here, each item in
a
is identical to the corresponding item inb
so we should get0 0 0
dissimilarity.It isn't the case that the function is simply reporting similarity instead of dissimilarity here. For example:
Here some items from
a
match their counterparts inb
and some don't, so we should get both0
s and1
s in the answer.It looks as though the same inconsistencies occur for at least some of the other
fuzz_
functions too.Suggestions
I am not sure what is intended for the case of multiple items in
a
, and from the source code I can't quite work out what is actually being done. Possibly the function is only intended for single-itema
. If this is the case, then maybe the documentation needs amending, and perhaps also an error or warning message ifa
has multiple items, since some R users will probably expect the standard vectorization rules.On the other hand, if vectorization is wanted here, then maybe one general solution is to first create a non-vectorized version of the function that calculates the distance for a single pair of strings, for example:
And then apply it to two vectors using
mapply()
, for example:This gives the standard R-style vectorization rules for the examples above:
And as a bonus this will give the standard warning if the longer of
a
andb
is not an integer multiple of the other:The text was updated successfully, but these errors were encountered: