-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate typos and error #86
Comments
Thx @maelle ! Right, some will be eiaser than others for sure. And may even take some consulting with people knowledgeable in the area to tell us what's invalid :) |
Interesting: https://github.com/mdlincoln/salty (examples of raw data are created with |
very cool, |
i'm not sure whether this would best be done within each generator or have a function that you call that will do the tweak appropriately for each variable the user selects inside of the function |
Not sure either 🤔 |
@maelle I started playing with this a bit on the z <- CoordinateProvider$new()
dat=replicate(1000, z$lat())
dat2=replicate(1000, z$lat(invalid = TRUE))
summary(dat)
summary(dat2) i was looking around for a sort of framework for invalid values for coordinates, rather than just adjusting numbers, but haven't found anything yet |
@sckott in terms of common coordinate errors, there are some guidelines in the CoordinateCleaner package: https://github.com/ropensci/CoordinateCleaner They include:
|
thanks @isteves ! yeah, the out of range of valid values for lat and lon is what is built in thus far (your first bullet). however, you can imagine many ways to do this. If you set the 2nd two bullets though are valid values of lat and lon by themselves but definitely would deserve a 2nd look as to whether they are correct or not. For "validity" itself, I think only the 1st bullet fits. I think i'd like to stick to strictly valid or invalid data generation as if we want to do this across the package where applicable I think it has to be somewhat consistent. But, we could think about generating something like "common mistakes" or similar which i think woul encompass your latter 2 bullets. |
That's fair. For the first point, I really meant to give an example of another "common mistake" (20, 20 versus 20, 22), but I guess I inadvertently gave a totally invalid example 😬 I like the distinction between "valid" and "common mistakes" to keep it more general 👍 |
ah okay, i see about the common mistakes. So I guess we have the following use cases we could support:
A question for all of the above is how to approach creating them (repeating from above comment, generalizing). Should |
For typos, from some projects I've done in the past, most computer data entry typos are a letter substituted for a nearby key in the keyboard layout being used. This is particularly the case in things like proper names of any kind, if any spellchecking is making the assumption that because of the capitalised first letter it is a name it may not know about. |
I now feel like I need to backtrack a bit... I wonder if it's best to just focus on "common typos"--whether it's number that's way out of range or typos. Perhaps lat/long-specific common mistakes are better suited to specialized packages (like CoordinateCleaner). In terms of typos, common categorical variables (jobs, color, t/f, marital status, etc...see https://github.com/trinker/wakefield for a bunch of examples) are probably the best way to go. With names/locations/etc, it's difficult to determine typos with certainty. |
thanks for your input @thoughtfulbloke ! i like that idea of a letter substituted by a nearby key. Do you know of any dataset/list of these?
of? it doesn't give typo's, correct? or does it? |
You could have a look at with the dataset |
Also, I just noticed https://github.com/colinmorris/reddit-dubious-spelling |
both look promising, thanks @thoughtfulbloke |
@sckott nope no typos, just some more examples of common categorical variables (in addition to what I saw in the |
I haven't been able to find an example in the faker packages of other languages, but then maybe I have missed existing stuff.
The idea would be to have something similar to MissingDataProvider but instead of replacing the picked values with NA's, it'd modify them slightly to make them invalid (for stuff that can be valid, e.g. phone numbers have a given format) or just different (e.g. for people names). I guess making an element different isn't too difficult, but making it invalid is a bit more effort.
cc @isteves
The text was updated successfully, but these errors were encountered: