-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recoding a factor is problematic when it is labelled. So is Prepare > Data Frame > Replace Values #7249
Comments
Having discussed with @dannyparsons he suggested this should be a simple (he said 30 minutes, but they may be Danny minutes) task for @shadrackkibet! If it isn't quick, then perhaps bring @volloholic or @dannyparsons back in, first. Data may come in labelled (usually from SPSS) and also R-Instat sometimes adds labels. This is particularly when a numeric variable is made into a factor. As an example consider the usual survey data and make the fert variable (numeric) into a factor. Then it is labelled. This is neat, because you can edit the labels e.g. make 0 into None, and then still make the variable numeric again. I hope the solution will be easy. It will be a test of the programming of R-Instat. When you produce another (numeric?) variable, from a labelled variable, then the resulting variable is not labelled, i.e. any value labels are deleted. I hope there is a single place where this can be done? I assume we have the code from the Position button in each of these situations, so that should be easy. So, nothing fancy. We just get rid of labelling when it causes, or could cause a problem. That's when a labelled variable is numeric. We could also have problems when a labelled variable is a factor. Once you have the routine for deleting labels, then we need to look systematically at what happens to labelled data, with our extensive range of operations on factors. We don't need labels - they are an extra, so (again) whenever an operation could cause a problem, we simply get rid of the labels. I hope this could be joint work for @shadrackkibet and @N-thony . I would very much like to see it resolved before the next release. |
@shadrackkibet that's great it is easy to remove labels. Now hope that @N-thony can use that to complete #7247 easily. It is a great relief that this isn't too difficult. I have been badgering Danny for ages on it! So, the problem is not that it is labelled and numeric. It is that when it is labelled and numeric we can't (usually) keep the labels when it is transformed. And I think that's also when there is the Position button? So the transformed variable (usually) can't have a label. I also had a long discussion on this issue with David @volloholic last night. He remains on hand if needed. He would like to take it a step further, namely that we could even keep the labels when transforming, if we wish. |
Fixed in PR #7321 |
@N-thony this may also relate to your task of being able to delete labels #7073.
The example I have is the Sadore data, in Climatic > Niger - worksheet Sadore leaving 4 lines.
Make Mois into a factor. Then perhaps in the Levels/Labels dialogue add Levels. This adds value labels to the variable
The column named Mois has 14 levels. Levels 1 and 14 are Janvier and janvier. Levels 2 and 13 are fevrier and Fevrier. (All the other months are lower case.)
I can use the Levels/Labels dialogue to change the labels. Or I can use Prepare > Column: Factor > Recode Factor.
The result is different, depending on which I use! I thought the solution was to delete the labelling, but I think there is a better one.
a) In the
Recode Factor
I can recode the labels in the obvious way. Then I (correctly) get 12 levels. The only possible problem is that janvier has level 14! That doesn't do any harm, but looks odd.b) So one solution would be to allow the user (in this situation) to recode the levels too. Maybe recode the levels would be an alternative to recoding the labels.
So, in this dialogue, there is a checkbox, for labelled data, default unchecked, which says
Recode Levels
. If checked then the Last column above is replaced by the levels and they have to be numeric, but can be changed.That would be neat!
c) I can also use the Right-Click Levels/Labels dialogue. This allows me to change the labels. But there are still 14 levels to the factor. I am now stuck. I think it may work ok, but it has become messy. Worse, here I can edit the levels and then it gives me an error.
One drastic solution is to not permit editing in this dialogue. Something needs to change though.
I suggest @N-thony can make the changes, (he made the changes recently in the Factor Recode dialogue) but it needs a small initial input from @dannyparsons.
Except I have now tried the Prepare > Data Frame > Replace Values dialogue, also on Labelled data. I have tried this with our usual survey data. Variety, make it numeric, and then back to a factor again. Now it has labels attached. Now make it numeric again and change 2 to 1. This has now changed all the frequencies. Also change 3 to 5 and then make it a factor again.
So, I now wonder if we are getting ourselves into a mess, by being able to edit levels. We don't use labelled data much - we like factors. These are a luxury. So I now suggest simply as follows:
a) We remain powerless to change levels - we just change labels, for Factors.
b) If a variable is labelled and is a factor, and we do a factor recode, we can warn and then, if we proceed, then the labelling is deleted.
c) If we change a number in a labelled numeric variable (in the grid, or in recode, etc?) then we warn and delete the labelling.
If you don't like this in your data, then make a copy of the variable (with labelling), before making the change.
Though maybe we can do better, by having options to also edit the levels? I used the Village factor and lumped the values less than 8 together. Then made it numeric - with labelled convert. Then factor again. Here is the current result - nice! (That made the extra levels for me!
I assume we will also have to delete labelling if we change strings? Oh I am confused!
Maybe instead, when we edit a factor, we should delete labelling and then reinstate it. Then, if we change numbers in a labelled numeric variable, could we then reinstate, with default labels. So, we reinstate levels when we edit a labelled factor or character variable and we reinstate labels, when we edit a labelled numeric variable? I like this solution?
This needs a @dannyparsons and/or @volloholic input and decision!
The text was updated successfully, but these errors were encountered: