-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorical (factor) handling in native R reader #86
Conversation
@lazappi will have more comments, but one thing I'll note is that the change must be back-compatible with the previous H5AD formats. I'm not familiar with the differences but it sounds like (A better approach would be to detect the file version up-front and pass it along to each internal function, which avoids the need for guessing inside each function. However, this would involve a more dramatic set of changes.) Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you. |
Thanks for catching that, fixed now.
I think the changes should be back-compatible. I did not delete the handling for the old categories format, but simply added code to handle factors encoded in the new format. Theoretically I think it could even handle an h5ad that contains a mix of factors encoded in both old & new formats (not that such a scenario would happen). I could switch to an explicit if/else instead if preferable, perhaps by checking if the EDIT: I went ahead and updated the code to wrap it in an if/else that checks if the |
Hi @jackkamm. Thanks for the contribution! The R reader is still a work in progress but I know some people have tried using it so this will be appreciated. I just have a couple of points:
|
Thanks for the reference -- looks like more work needs to be done to support nullable booleans and ints as per the spec. I will look into it.
Sounds good, I've added a test, still a work-in-progress since it needs to check the nullable booleans mentioned above.
I think it's basically the same issue, I was getting the same error message, and the PR fixed it for me (as in I was able to read in the h5ad). But the original PR only fixed a subset of the problem (factors/categories), a little more work needs to be done to make sure other types are also converted correctly. |
👍🏻 It sounds like you are still working on some things which is great, just ping me when you are ready for a review |
I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review. For the test, I found that Footnotes: [1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: https://github.com/scverse/anndata/blob/8e793af01a77d0e31e91a72f3988df7d6de9cdc5/anndata/_io/h5ad.py#L58 [2] [3] Here is the Python code I used to create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/read.R
Outdated
@@ -289,16 +298,57 @@ readH5AD <- function(file, X_name = NULL, use_hdf5 = FALSE, | |||
mat | |||
} | |||
|
|||
.read_convert <- function(file, path, recursive=FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think we would ever want to do this non-recursively? Not a big deal, just wondering if the argument is needed (but I think it's fine to leave it).
R/read.R
Outdated
rhdf5::h5read(file, file.path(path, col_name)) | ||
) | ||
out_cols[[col_name]] <- .read_convert(file, file.path(path, col_name), | ||
recursive=FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see we use recursive=FALSE
here, can probably ignore the earlier comment then
tests/testthat/test-read.R
Outdated
# check colData columns that Python reader is able to handle | ||
good_coldat_columns <- c('cell_type', 'dummy_bool', 'dummy_int', | ||
'dummy_num', 'dummy_num2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the NA
thing that doesn't work with the Python reader or something else?
expect_equal(colData(sce_r)$dummy_bool2, | ||
c(FALSE, NA, rep(TRUE, 638))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. This shouldn't be happening, any thoughts on what might be going on here?
tests/testthat/test-read.R
Outdated
# a bug in the python reader?) | ||
expect_equal( | ||
as.vector(metadata(sce_py)[['dummy_bool']]), | ||
metadata(sce_r)[['dummy_bool']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm...not sure. It may be something to do with how {reticulate} converts single values.
I think we need to look into what is going on with the Python reader, I'm not sure about the environment business... That can probably be a separate PR though, but if you want to open an issue about it that would be helpful.
I don't think it always happens, but yes, most of the time it does.
Another thing to check...
This should be added to |
Thanks for the helpful reviews :) I might be a little slow getting to it due to some other deadlines right now, but hopefully should respond & have this done later next week |
a2e8183
to
f639a90
Compare
I've revised this PR now based on the feedback. I also squashed all the commits and force pushed. Biggest changes are:
Also in regard to this:
Seems like the conversion doesn't happen when the string values are all unique. I added such a column of unique strings to the test data. rhdf5 seems able to read it without any issue. |
Another minor comment I forgot to add:
This might be a relatively minor issue. When I tested it before, rhdf5 and h5py both converted the -2^31 to NA/nan when reading it into R/python, so users may not notice the issue in practice. |
R/read.R
Outdated
version <- match.arg(version, .AnnDataVersions) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the right thing to do. The current .AnnDataVersions
records the versions there are {basilisk} environments for but that might not match up with all the possible Python versions. It probably depends a bit what this is used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I changed it now to:
if (is.null(version)) {
version <- .AnnDataVersions[1]
}
so when it's null it'll match the default python version, but otherwise isn't constrained.
R/read.R
Outdated
# Should we wrap this in suppressWarnings? rhdf5 will warn that it | ||
# can't yet read enum (factor/boolean) in attributes | ||
element_attrs <- rhdf5::h5readAttributes(file, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...maybe. If it's something we handle so the warning isn't meaningful to the user then I guess that would make sense. There's just the risk that we suppress a future warning that it would be useful to know about.
R/read.R
Outdated
# Can't determine orderedness due to rhdf5 not yet supporting | ||
# enums in attributes | ||
#ord <- as.logical(element_attrs[["ordered"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess this is where the warning comes in. If we aren't really addressing it maybe we should leave it (and try to fix this upstream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it would be good to fix this upstream...
Also, I consolidated the 2 comments into a single comment, to try and make this easier to keep track of in future.
Thinking about it some more, I may have been confused about the "version" argument, and maybe it was a mistake to try to use it at all. After all the python AnnData package is able to read in older AnnData just fine without needing to specify the version. So we should too. As it stands now, the PR won't properly read AnnData v0.7 when I'm pretty sure if I remove the Let me know if you want me to revert the explicit version handling and I'll update the PR accordingly. |
R/read.R
Outdated
) | ||
out_cols[[cat_name]] <- factor(out_cols[[cat_name]]) | ||
levels(out_cols[[cat_name]]) <- levels | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem I mentioned in my comment just now. AnnData 0.7 categories are only converted if version
explicitly specified.
So, in the Python reader the version is there to control which environment is used (and therefore which AnnData file version is read/written). In the R reader we aren't messing around with environments so it maybe isn't needed. It just depends whether it's easier to detect what file version has been used or ask the user to specify it. If we were looking at a writer it would be a bit different because in that case we would want to give the user control over which file version is written. |
I made 1 more commit, so that the version isn't passed into native R reader anymore. Instead, native R reader just tries to figure out the right thing to do based on the keywords/attributes it sees, like in the original version of this PR. I think it's probably better this way, so the user doesn't need to explicitly specify the version. But I can see the argument the other way also. So I leave this last change as a separate unsquashed commit -- feel free to revert it if you prefer the previous approach that explicitly passes in the version. |
Thanks for all your work on this! I suspect it might need some tweaks in the future but I have merged what we have so far. |
When I try to use the native R reader on an h5ad file I have, like so:
I get the following warning:
And the
colData
androwData
of the returned SCE are empty.Tracking it down, it's because zellkonverter was unable to convert some Categorical (factor) columns in the obs/var. It's because AnnData 0.8 changed the way Categoricals are encoded.
This updates the native R reader so it can read factors encoded with the newer format. After applying this patch I was able to read my h5ad file.
EDIT:
Related: #78
Thought I should also add why I'm using the R reader rather than the recommended python reader. It's because the python reader had errors reading X/layers matrices from my h5ad file due to unsorted indptr (seems to be the same issue as theislab/anndata2ri#51). I was able to fix some of the errors by calling
sort_indices()
and resaving the AnnData, but some matrices still wouldn't load, and the ones that did were loaded as dgCMatrix instead of DelayedArray. By contrast I found the R reader successfully read all matrices from my h5ad file as DelayedArray.