-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data not correctly read in with haven::read_sav whn special characters present. Character encoding problem? #560
Comments
Hi, poking around the data it looks like the SAV file has encoded a null byte in the middle of the problematic string (after "car leurs si"). ReadStat / Haven always interprets null bytes as the end of the string, but it sounds like SPSS will accept (or discard?) the null byte and use the rest of the string. Do you see any special indicator between the phrases "car leurs si" and " èges se trouvent" when opening the file in SPSS? |
Thanks @evanmiller for looking into it. I don't see anything suspicious in that response between these two parts of the string. Of course, I can only check this to some extent, because in SPSS I can only look at the visual output and can't inspect the encoding or the byte structure in the background. Important question would be: what would be the correct behaviour? Would it be wrong to accept such a null byte? or would it be ok and haven should accept it? Interestingly (and I've closed this post in favour of this new one), at some point after I did some recoding to the respective character column, even opening in SPSS led to SPSS automatically split the column into several ones after 255 characters (and the same happened when opening in R with read_sav): #559 |
The null byte would need to be stripped in ReadStat somehow, because C regards null bytes as string terminators. So we'd need an extra step in ReadStat to scan the strings for internal null bytes before handing a clean C string to haven. Regarding the column-splitting issue #547 may be related. The SAV format splits large columns internally and assigns them numbers on top of a five-character prefix. ReadStat may not mirror this logic exactly. |
Thanks a lot. Sounds promising and indeed a feature (if not bug?) that would require fixing in haven? As for #547 I saw that too, but I'm skeptical if it is related, because I have several other character columns that seem to work without problems. But I'll check if it really could be related to this issue. Thinking about it, most (if not all) requirements mentioned in #547 seem to be true for my case as well. Update: I can confirm that the issue #547 is present in my data set as well. My original variables have names "QB5B_1" and "QB5B_2". Renaming them and reopening them with |
The null byte issue should be fixed in WizardMac/ReadStat@7b4357d. It may take a while to reach haven. We can continue the column-splitting discussion over at #547. |
Just updated readstat, so this should now be fixed. |
This is an exact copy of a question I posted on stack overflow. However, I think it is a direct issue in
read_sav
, so I'm hoping to fix the issue (if it is one in the package):I have a problem with my data set which I'm downloading from a website through an API call. A reproducible problem can be found here.
The data set is a .sav format file (usually opened in SPSS) and contains a character column which
haven::read_sav
seems to be failing to properly read in.In fact, running the following code:
gives the following result (I shortened the output a bit for better readability):
However, the second result is wrong. The character is actually 444 characters long. Opening the file in SPSS works fine (shows 444 characters), downloading my data in a different format, e.g. csv or xlsx, also works fine and gives the correct results. The problems are just present when using a .sav file (which is a requirement in my case) + the
read_sav
function.Any ideas what I can do about it?
my data: I'm sharing the data as raw vector that comes directly from my APi pipeline. Please let me know if it would be preferable to directly get a .sav file e.g through Google drive.
data in raw format (which is identical to what you get as
download_content
:The text was updated successfully, but these errors were encountered: