Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recoding a factor is problematic when it is labelled. So is Prepare > Data Frame > Replace Values #7249

Closed
rdstern opened this issue Feb 22, 2022 · 3 comments · Fixed by #7321
Assignees
Milestone

Comments

@rdstern
Copy link
Collaborator

rdstern commented Feb 22, 2022

@N-thony this may also relate to your task of being able to delete labels #7073.
The example I have is the Sadore data, in Climatic > Niger - worksheet Sadore leaving 4 lines.
Make Mois into a factor. Then perhaps in the Levels/Labels dialogue add Levels. This adds value labels to the variable
The column named Mois has 14 levels. Levels 1 and 14 are Janvier and janvier. Levels 2 and 13 are fevrier and Fevrier. (All the other months are lower case.)
I can use the Levels/Labels dialogue to change the labels. Or I can use Prepare > Column: Factor > Recode Factor.
The result is different, depending on which I use! I thought the solution was to delete the labelling, but I think there is a better one.
a) In the Recode Factor I can recode the labels in the obvious way. Then I (correctly) get 12 levels. The only possible problem is that janvier has level 14! That doesn't do any harm, but looks odd.
b) So one solution would be to allow the user (in this situation) to recode the levels too. Maybe recode the levels would be an alternative to recoding the labels.
image
So, in this dialogue, there is a checkbox, for labelled data, default unchecked, which says Recode Levels. If checked then the Last column above is replaced by the levels and they have to be numeric, but can be changed.
That would be neat!
c) I can also use the Right-Click Levels/Labels dialogue. This allows me to change the labels. But there are still 14 levels to the factor. I am now stuck. I think it may work ok, but it has become messy. Worse, here I can edit the levels and then it gives me an error.
One drastic solution is to not permit editing in this dialogue. Something needs to change though.

I suggest @N-thony can make the changes, (he made the changes recently in the Factor Recode dialogue) but it needs a small initial input from @dannyparsons.
Except I have now tried the Prepare > Data Frame > Replace Values dialogue, also on Labelled data. I have tried this with our usual survey data. Variety, make it numeric, and then back to a factor again. Now it has labels attached. Now make it numeric again and change 2 to 1. This has now changed all the frequencies. Also change 3 to 5 and then make it a factor again.

So, I now wonder if we are getting ourselves into a mess, by being able to edit levels. We don't use labelled data much - we like factors. These are a luxury. So I now suggest simply as follows:
a) We remain powerless to change levels - we just change labels, for Factors.
b) If a variable is labelled and is a factor, and we do a factor recode, we can warn and then, if we proceed, then the labelling is deleted.
c) If we change a number in a labelled numeric variable (in the grid, or in recode, etc?) then we warn and delete the labelling.

If you don't like this in your data, then make a copy of the variable (with labelling), before making the change.

Though maybe we can do better, by having options to also edit the levels? I used the Village factor and lumped the values less than 8 together. Then made it numeric - with labelled convert. Then factor again. Here is the current result - nice! (That made the extra levels for me!
image

I assume we will also have to delete labelling if we change strings? Oh I am confused!

Maybe instead, when we edit a factor, we should delete labelling and then reinstate it. Then, if we change numbers in a labelled numeric variable, could we then reinstate, with default labels. So, we reinstate levels when we edit a labelled factor or character variable and we reinstate labels, when we edit a labelled numeric variable? I like this solution?

This needs a @dannyparsons and/or @volloholic input and decision!

@rdstern rdstern added this to the 0.7.5 milestone Feb 22, 2022
@rdstern rdstern changed the title Recoding a factor is problematic when the factor is labelled. Recoding a factor is problematic when it is labelled. Feb 22, 2022
@rdstern rdstern changed the title Recoding a factor is problematic when it is labelled. Recoding a factor is problematic when it is labelled. So is Prepare > Data Frame > Replace Values Feb 27, 2022
@rdstern rdstern assigned shadrackkibet and unassigned N-thony Mar 5, 2022
@rdstern
Copy link
Collaborator Author

rdstern commented Mar 5, 2022

Having discussed with @dannyparsons he suggested this should be a simple (he said 30 minutes, but they may be Danny minutes) task for @shadrackkibet! If it isn't quick, then perhaps bring @volloholic or @dannyparsons back in, first.
a) If @N-thony still needs help on the function to delete labels in the View Data dialogue - pull request #7247, then this is probably the same function that will be needed here. So perhaps Shadrack could start there.
b) Then the idea is as follows - at least for now.

Data may come in labelled (usually from SPSS) and also R-Instat sometimes adds labels. This is particularly when a numeric variable is made into a factor. As an example consider the usual survey data and make the fert variable (numeric) into a factor. Then it is labelled. This is neat, because you can edit the labels e.g. make 0 into None, and then still make the variable numeric again.
However, try this (silly) sequence and you see our labelling can cause nonsense.
a) Perhaps duplicate fert first. Then use the calculator to calculate size+fert, into calc. Then make calc a factor. This adds labels, and the factor has 17 levels.
b) Make fert into a factor, and then back to numeric. Now do a) above. the calculated column is now a factor with 22 levels and the levels/labels information is a mess.

I hope the solution will be easy. It will be a test of the programming of R-Instat. When you produce another (numeric?) variable, from a labelled variable, then the resulting variable is not labelled, i.e. any value labels are deleted. I hope there is a single place where this can be done? I assume we have the code from the Position button in each of these situations, so that should be easy. So, nothing fancy. We just get rid of labelling when it causes, or could cause a problem.

That's when a labelled variable is numeric. We could also have problems when a labelled variable is a factor. Once you have the routine for deleting labels, then we need to look systematically at what happens to labelled data, with our extensive range of operations on factors. We don't need labels - they are an extra, so (again) whenever an operation could cause a problem, we simply get rid of the labels.

I hope this could be joint work for @shadrackkibet and @N-thony . I would very much like to see it resolved before the next release.

@rdstern
Copy link
Collaborator Author

rdstern commented Mar 6, 2022

@shadrackkibet that's great it is easy to remove labels. Now hope that @N-thony can use that to complete #7247 easily. It is a great relief that this isn't too difficult. I have been badgering Danny for ages on it!
But I would like to be a bit cleverer than that if possible. Ideally the problem isn't when you make it numeric again. Here is an example - still with the survey data - where I might want to keep the labelling when it is numeric.
a) I make the fert into a factor - it is now labelled.
b) I now edit the labels so 0 becomes None. Perhaps also 3 becomes Max.
c) Now I make it numeric. It can do that and I get the 0 to 3 back. Or I can make it character. etc.
d) Now I can make it a factor again, and it still has the labels. Nice.
So, that's one of the key uses of labelled data. And R doesn't usually do that. (Of course it is still R!)

So, the problem is not that it is labelled and numeric. It is that when it is labelled and numeric we can't (usually) keep the labels when it is transformed. And I think that's also when there is the Position button? So the transformed variable (usually) can't have a label.

I also had a long discussion on this issue with David @volloholic last night. He remains on hand if needed. He would like to take it a step further, namely that we could even keep the labels when transforming, if we wish.
He is very happy that the default is to delete the label. But (when the resulting variable will be labelled, he would like a checkbox to be able to keep it, if the user desires. This doesn't have to be obvious, so could perhaps just be a checkbox on the Position sub-dialogue?

@lloyddewit
Copy link
Contributor

Fixed in PR #7321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment