-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.Rmd
268 lines (191 loc) · 7.51 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# SoundexBR <img src="inst/figures/SoundexBR-logo.png" width="240px" align="right" />
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://www.tidyverse.org/lifecycle/#stable)
[![Build Status](https://travis-ci.org/danielmarcelino/SoundexBR.svg?branch=master)](https://travis-ci.org/danielmarcelino/SoundexBR)
![CRAN Version](https://www.r-pkg.org/badges/version/SoundexBR)
![](https://img.shields.io/badge/license-GPL%20%28%3E=%202%29-blueviolet.svg?style=flat)
![](https://cranlogs.r-pkg.org/badges/grand-total/SoundexBR)
## Phonetic-Coding For Portuguese
The SoundexBR package provides an algorithm for decoding names into phonetic codes as pronounced in Portuguese. The goal is for homophone strings to be encoded with same alphanumeric representation, so that they can match despite *minor differences* in spelling.
The Soundex algorithm encodes mainly consonants by default. However, a vowel will be encoded or counted if it's the first letter. The resultant code consists of a string four digits long, composed by one letter followed by three numerical digits: `[LETTER]` `[0-9]` `[0-9]` `[0-9]`. The letter is the first letter of the name while the digits encode the remaining consonants.
As one can imagine now, the _SoundexBR_ resultant string can be very useful at identifying "close" matches that would typically fail due to variant spelling of names or transposition errors. For instance, the difference in the names _Clair_ and _Claire_ is enough to cause deterministic linkage to fail when comparing them, but the _SoundexBR_ will return the same string "C460" for both names. A walkthrough in the [vignette](vignettes/SoundexBR.html) provides more information.
## Installation
1 - From the CRAN repository:
```{r, eval=FALSE}
install.packages('SoundexBR', dep=TRUE)
library(SoundexBR)
```
2 - To get the current development version from Github:
```{r, eval=FALSE}
## install devtools package if it's not already
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
install_github("danielmarcelino/SoundexBR")
library(SoundexBR)
```
## Usage
### A silly example
```{r, eval=FALSE}
names <-
c(
'Ana Karolina Kuhnen',
'Ana Carolina Kuhnen',
'Ana Karolina',
'João Souza',
'João Sousa',
'Dilma Vana Rousseff',
'Dilma Rousef'
)
soundexBR(names)
[1] "A526" "A526" "A526" "J220" "J220" "D451" "D456"
```
### The SoundexBR *vs* the original Soundex values
```{r, eval=FALSE}
names2 <- c("HILBERT", "Heilbronn", "Gauss", "Kant")
```
##### Original Soundex outcome
```{r, eval=FALSE}
soundexBR(names2, BR=FALSE)
[1] "H416" "H416" "G200" "K530"
```
##### The SoundexBR outcome
```{r, eval=FALSE}
soundexBR(names2)
[1] "I416" "E416" "G200" "C530"
```
## Example with RecordLinkage:
### Some data
```{r, eval=FALSE}
data1 <- data.frame(list(
first_name = c('Ricardo', 'Maria', 'Tereza', 'Pedro', 'José', 'Germano'),
last_name = c('Cunha', 'Andrade', 'Silva', 'Soares', 'Silva', 'Lima'),
age = c(67, 89, 78, 65, 68, 67),
birth = c(1945, 1923, 1934, 1947, 1944, 1945),
date = c(20120907, 20120703, 20120301, 20120805, 20121004, 20121209)
))
```
```{r, eval=FALSE}
data2 <-
data.frame(list(
first_name = c('Maria', 'Lúcia', 'Paulo', 'Marcos', 'Ricardo', 'Germânio'),
last_name = c('Andrada', 'Silva', 'Soares', 'Pereira', 'Cunha', 'Lima'),
age = c(67, 88, 78, 60, 67, 80),
birth = c(1945, 1924, 1934, 1952, 1945, 1932),
date = c(20121208, 20121103, 20120302, 20120105, 20120907, 20121209)
))
```
### Must call RecordLinkage package
```{r, eval=FALSE}
pairs <- compare.linkage(
data1,
data2,
blockfld = list(c(1, 2, 4), c(1, 2)),
phonetic <- c(1, 2),
phonfun = soundexBR,
strcmp = FALSE,
strcmpfun <- jarowinkler,
exclude = FALSE,
identity1 = NA,
identity2 = NA,
n_match <- NA,
n_non_match = NA
)
```
```{r, eval=FALSE}
print(pairs)
$data1
first_name last_name age birth date
1 Ricardo Cunha 67 1945 20120907
2 Maria Andrade 89 1923 20120703
3 Tereza Silva 78 1934 20120301
4 Pedro Soares 65 1947 20120805
5 José Silva 68 1944 20121004
6 Germano Lima 67 1945 20121209
$data2
first_name last_name age birth date
1 Maria Andrada 67 1945 20121208
2 Lúcia Silva 88 1924 20121103
3 Paulo Soares 78 1934 20120302
4 Marcos Pereira 60 1952 20120105
5 Ricardo Cunha 67 1945 20120907
6 Germânio Lima 80 1932 20121209
$pairs
id1 id2 first_name last_name age birth date is_match
1 1 5 1 1 1 1 1 NA
2 6 6 0 1 0 0 1 NA
3 2 1 1 0 0 0 0 NA
$frequencies
first_name last_name age birth date
0.1000000 0.1428571 0.1250000 0.1250000 0.1000000
$type
[1] "linkage"
attr(,"class")
[1] "RecLinkData"
```
### Editing correspondences
```{r, eval=FALSE}
editMatch(pairs)
```
### Accessing information within object:
```{r, eval=FALSE}
weights <- epiWeights(pairs, e = 0.01, f = pairs$frequencies)
hist(weights$Wdata, plot = FALSE) # Plot TRUE
$breaks
[1] 0.2 0.4 0.6 0.8 1.0
$counts
[1] 2 0 0 1
$density
[1] 3.333333 0.000000 0.000000 1.666667
$mids
[1] 0.3 0.5 0.7 0.9
$xname
[1] "weights$Wdata"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
getPairs(pairs, max.weight = Inf, min.weight = -Inf)
id first_name last_name age birth date Weight
1 1 Ricardo Cunha 67 1945 20120907
2 5 Ricardo Cunha 67 1945 20120907 <NA>
3
4 6 Germano Lima 67 1945 20121209
5 6 Germânio Lima 80 1932 20121209 <NA>
6
7 2 Maria Andrade 89 1923 20120703
8 1 Maria Andrada 67 1945 20121208 <NA>
```
## The Algorithm in a Nutshell
Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step.
Retain the first letter of the word. However, if the first letter of the word is
**H**, retain the second letter. If the first letter of the word is
**Y**, change to **I**. If the combination of the first and the second letters is:
**WA**, change to **VA**. If the combination of the first and the second letters is:
**KA**, change to **CA**. If the combination of the first and the second letters is:
**KO**, change to **CO**. If the combination of the first and the second letters is:
**KU**, change to **CU**. If the combination of the first and the second letters is:
**CI**, change to **SI**. If the combination of the first and the second letters is:
**CE**, change to **SE**. If the combination of the first and the second letters is:
**GE**, change to **JE**. If the combination of the first and the second letters is:
**GI**, change to **JI**.
Change all occurrence of the following letters to '0' (zero):
```A, E, I, O, U, H, W, Y.```
Change letters from the following sets into the digit given:
```1 = B, F, P, V```
```2 = C, G, J, K, Q, S, X, Z```
```3 = D, T```
```4 = L```
```5 = M, N```
```6 = R ```
Remove all pairs of digits which occur beside each other from the string that resulted after step (4). Remove all zeros from the string that results from step 5.0 (computed in step 3). Pad the resultant string from step (6) with trailing zeros and return only the first four positions, which will be of the form `[ALPHA]` `[0-9]` `[0-9]` `[0-9]`.