Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support binary compression in sas7bdat #21

Closed
evanmiller opened this issue Feb 23, 2015 · 22 comments
Closed

Support binary compression in sas7bdat #21

evanmiller opened this issue Feb 23, 2015 · 22 comments

Comments

@evanmiller
Copy link
Contributor

Binary (aka Ross) compression is currently not supported. It looks like we can use the Python sas7bdat package as a template for implementing the decompression algorithm:

https://pypi.python.org/pypi/sas7bdat

(The library is MIT licensed.)

At present, compressed files fail with the error message "File has unsupported compression scheme".

@xiaodaigh
Copy link

Compression support is definitely needed! I have to stick to the parso library in Java until this is supported.

@ajdamico
Copy link

hi, not sure if this is the right place for a minimal reproducible example since tidyverse/haven#31 was closed? the latest version of haven still fails on compressed data.. thanks

# devtools::install_github('biostatmatt/sas7bdat.parso')
library(haven)
library(sas7bdat.parso)


tf1 <- tempfile()

download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/hhld.sas7bdat" , tf1 , mode = 'wb' )

# fails
x1 <- read_sas( tf1 )

# works
y1 <- read.sas7bdat.parso( tf1 )

@hadley
Copy link
Contributor

hadley commented May 30, 2016

@evanmiller is this on the schedule?

@evanmiller
Copy link
Contributor Author

No schedule for this -- note that many compression issues were misdiagnosed as binary compression rather than bugs in the character decompressor. Seems like 90%+ compressed files in the wild are character compressed.

@evanmiller
Copy link
Contributor Author

Note that the example file provided by @ajdamico was fixed in 69d5751 and 8c0463a.

@reikoch
Copy link
Contributor

reikoch commented Nov 17, 2016

If a truly binary compressed SAS dataset is needed, you may use [https://github.com/reikoch/testfiles/blob/master/binary.sas7bdat]. haven 1.0 fails with "ReadStat: Error parsing page 0, bytes 8192-16383".

@sclewis23
Copy link

Is this encoding error related to compression ?
Unsupported character set code: 204.
tidyverse/haven#482

@evanmiller
Copy link
Contributor Author

@sclewis23 No - the error is related to the file's character encoding.

If you know which encoding was used to create the file, I can try to add support.

@sclewis23
Copy link

@evanmiller - the encoding is set to "any"
outencoding=any

@sclewis23
Copy link

sclewis23 commented Nov 6, 2019

@evanmiller
here is a sample SAS program, creates one with UTF-8 and "any" :

#SAS CODE:
# Data Weight2; 
# input IDnumber $ week1 week16; 
# AverageLoss=week1-week16; 
# datalines; 
# 2477 195 163
# 2431 220 198
# 2456 173 155
# 2412 135 116
# ;
# libname outlib '~' outencoding='UTF-8';
# data outlib.Weight_utf8;
# Set Weight2;
# Run;
# libname out_any '~' outencoding='any';
# data out_any.Weight2;
# Set Weight2;
# Run;

@sclewis23
Copy link

@evanmiller
I can send example files if that helps?

@evanmiller
Copy link
Contributor Author

If anyone has example files, please try them with this new code branch:

https://github.com/WizardMac/ReadStat/tree/sas-binary-compression

@xiaodaigh
Copy link

anyone has example files,

That is usually the problem!

I try to keep track of it here

https://github.com/xiaodaigh/sas7bdat-resources

@sclewis23
Copy link

If anyone has example files, please try them with this new code branch:

https://github.com/WizardMac/ReadStat/tree/sas-binary-compression

I'll create a sample tomorrow.

@ofajardo
Copy link

ofajardo commented Aug 26, 2020

tested OK in pyreadstat with the attached sample file I generated in SAS like this:

data SAMPLES.sample_bincompressed(compress=binary);
set SAMPLES.sample;
run;

The file is stored permanently in the pyreadstat repo in the test_data/basic/sample_bincompressed.sas7bdat, for now in the sasbin_dev branch.

sample_bincompressed.sas7bdat.zip

@evanmiller
Copy link
Contributor Author

@ofajardo Thanks for testing – I will wait a few days for results from other files and then merge if everything looks okay.

@sclewis23
Copy link

Here is a very small sample file with binary compression(4 rows).

weigth2.zip

@evanmiller
Copy link
Contributor Author

@sclewis23 Is this the correct data?

"IDnumber","week1","week16","AverageLoss"
"2477",195.000000,163.000000,32.000000
"2431",220.000000,198.000000,22.000000
"2456",173.000000,155.000000,18.000000
"2412",135.000000,116.000000,19.000000

@sclewis23
Copy link

Looks good:
image

@reikoch
Copy link
Contributor

reikoch commented Aug 27, 2020

Looks good on my two testfiles dates_binary.sas7bdat and dates_longname_binary.sas7bdat in https://github.com/reikoch/testfiles - congratulations!

@evanmiller
Copy link
Contributor Author

@reikoch Thanks for letting me know. The reports are all positive so I'll get this merged into dev later today.

@evanmiller evanmiller pinned this issue Aug 27, 2020
@evanmiller evanmiller unpinned this issue Aug 27, 2020
@evanmiller
Copy link
Contributor Author

Merged into master and included in 1.1.4 - closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants