Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix categorical (factor) handling in native R reader #86

Merged
merged 2 commits into from
Mar 15, 2023

Conversation

jackkamm
Copy link
Contributor

@jackkamm jackkamm commented Feb 6, 2023

When I try to use the native R reader on an h5ad file I have, like so:

sce <- readH5AD("/path/to/my/anndata.h5ad", 
                reader='R',
                use_hdf5=TRUE,
                verbose=TRUE)

I get the following warning:

Warning messages:
1: In value[[3L]](cond) : setting 'colData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame
2: In value[[3L]](cond) : setting 'rowData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame

And the colData and rowData of the returned SCE are empty.

Tracking it down, it's because zellkonverter was unable to convert some Categorical (factor) columns in the obs/var. It's because AnnData 0.8 changed the way Categoricals are encoded.

This updates the native R reader so it can read factors encoded with the newer format. After applying this patch I was able to read my h5ad file.

EDIT:

Related: #78

Thought I should also add why I'm using the R reader rather than the recommended python reader. It's because the python reader had errors reading X/layers matrices from my h5ad file due to unsorted indptr (seems to be the same issue as theislab/anndata2ri#51). I was able to fix some of the errors by calling sort_indices() and resaving the AnnData, but some matrices still wouldn't load, and the ones that did were loaded as dgCMatrix instead of DelayedArray. By contrast I found the R reader successfully read all matrices from my h5ad file as DelayedArray.

@LTLA
Copy link
Contributor

LTLA commented Feb 6, 2023

@lazappi will have more comments, but one thing I'll note is that the change must be back-compatible with the previous H5AD formats. I'm not familiar with the differences but it sounds like __categories no longer exists in the new format, in which case you could check for that in names(fields) and switch to your approach if it doesn't exist. This avoids weird interactions between old versions of the files that happen to use the new keywords.

(A better approach would be to detect the file version up-front and pass it along to each internal function, which avoids the need for guessing inside each function. However, this would involve a more dramatic set of changes.)

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

@jackkamm
Copy link
Contributor Author

jackkamm commented Feb 7, 2023

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

Thanks for catching that, fixed now.

but one thing I'll note is that the change must be back-compatible with the previous H5AD formats

I think the changes should be back-compatible. I did not delete the handling for the old categories format, but simply added code to handle factors encoded in the new format. Theoretically I think it could even handle an h5ad that contains a mix of factors encoded in both old & new formats (not that such a scenario would happen).

I could switch to an explicit if/else instead if preferable, perhaps by checking if the __categories key exists, or maybe there's a better way to determine which format the AnnData is in?

EDIT: I went ahead and updated the code to wrap it in an if/else that checks if the __categories key exists.

@lazappi
Copy link
Member

lazappi commented Feb 7, 2023

Hi @jackkamm. Thanks for the contribution! The R reader is still a work in progress but I know some people have tried using it so this will be appreciated. I just have a couple of points:

@jackkamm
Copy link
Contributor Author

jackkamm commented Feb 9, 2023

There is now a written spec for the anndata 0.8 H5AD format https://anndata.readthedocs.io/en/latest/fileformat-prose.html. If you could check that everything is consistent that would be great.

Thanks for the reference -- looks like more work needs to be done to support nullable booleans and ints as per the spec. I will look into it.

It would be great to have at least one test for this (either a normal test or a long test). The simplest thing might be to write a file with the Python 0.8 writer and see if it can be read with the R reader (making sure it has the right kind of columns).

Sounds good, I've added a test, still a work-in-progress since it needs to check the nullable booleans mentioned above.

Do you think this solves #78 as well or is that a slightly different issue?

I think it's basically the same issue, I was getting the same error message, and the PR fixed it for me (as in I was able to read in the h5ad). But the original PR only fixed a subset of the problem (factors/categories), a little more work needs to be done to make sure other types are also converted correctly.

@lazappi
Copy link
Member

lazappi commented Feb 9, 2023

👍🏻 It sounds like you are still working on some things which is great, just ping me when you are ready for a review

@jackkamm
Copy link
Contributor Author

jackkamm commented Feb 17, 2023

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: https://github.com/scverse/anndata/blob/8e793af01a77d0e31e91a72f3988df7d6de9cdc5/anndata/_io/h5ad.py#L58

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: https://gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

Copy link
Member

@lazappi lazappi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm pretty happy with this. I have made a few comments but they are mostly questions. I think this is a fairly significant contribution so I would be happy for you to add yourself as a contributor in the DESCRIPTION if you like.

@LTLA @ivirshup do you have any further comments?

R/read.R Outdated
@@ -289,16 +298,57 @@ readH5AD <- function(file, X_name = NULL, use_hdf5 = FALSE,
mat
}

.read_convert <- function(file, path, recursive=FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think we would ever want to do this non-recursively? Not a big deal, just wondering if the argument is needed (but I think it's fine to leave it).

R/read.R Outdated Show resolved Hide resolved
R/read.R Outdated Show resolved Hide resolved
R/read.R Outdated Show resolved Hide resolved
R/read.R Outdated
rhdf5::h5read(file, file.path(path, col_name))
)
out_cols[[col_name]] <- .read_convert(file, file.path(path, col_name),
recursive=FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we use recursive=FALSE here, can probably ignore the earlier comment then

tests/testthat/test-read.R Outdated Show resolved Hide resolved
tests/testthat/test-read.R Outdated Show resolved Hide resolved
Comment on lines 91 to 93
# check colData columns that Python reader is able to handle
good_coldat_columns <- c('cell_type', 'dummy_bool', 'dummy_int',
'dummy_num', 'dummy_num2')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the NA thing that doesn't work with the Python reader or something else?

Comment on lines +100 to +99
expect_equal(colData(sce_r)$dummy_bool2,
c(FALSE, NA, rep(TRUE, 638)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This shouldn't be happening, any thoughts on what might be going on here?

# a bug in the python reader?)
expect_equal(
as.vector(metadata(sce_py)[['dummy_bool']]),
metadata(sce_r)[['dummy_bool']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm...not sure. It may be something to do with how {reticulate} converts single values.

@lazappi
Copy link
Member

lazappi commented Feb 20, 2023

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

I think we need to look into what is going on with the Python reader, I'm not sure about the environment business... That can probably be a separate PR though, but if you want to open an issue about it that would be helpful.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: scverse/anndata@8e793af/anndata/_io/h5ad.py#L58

I don't think it always happens, but yes, most of the time it does.

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

Another thing to check...

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

This should be added to inst/scripts as a description of the dataset. If you could add a comment with the important package versions that would be great as well.

@jackkamm
Copy link
Contributor Author

Thanks for the helpful reviews :) I might be a little slow getting to it due to some other deadlines right now, but hopefully should respond & have this done later next week

@jackkamm jackkamm force-pushed the master branch 2 times, most recently from a2e8183 to f639a90 Compare March 5, 2023 21:53
@jackkamm
Copy link
Contributor Author

jackkamm commented Mar 5, 2023

I've revised this PR now based on the feedback. I also squashed all the commits and force pushed.

Biggest changes are:

  • Use the encoding-type attribute during conversion. Also pass in the path to the recursive conversion function in order to do this.
  • Pass version along to internal functions as @LTLA recommended earlier. If version < 0.8, then behavior of native reader is unchanged from before.

Also in regard to this:

I don't think it [string to factor conversion] always happens, but yes, most of the time it does.

Seems like the conversion doesn't happen when the string values are all unique. I added such a column of unique strings to the test data. rhdf5 seems able to read it without any issue.

@jackkamm
Copy link
Contributor Author

jackkamm commented Mar 5, 2023

Another minor comment I forgot to add:

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.
Another thing to check...

This might be a relatively minor issue. When I tested it before, rhdf5 and h5py both converted the -2^31 to NA/nan when reading it into R/python, so users may not notice the issue in practice.

R/read.R Outdated
Comment on lines 110 to 113
version <- match.arg(version, .AnnDataVersions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right thing to do. The current .AnnDataVersions records the versions there are {basilisk} environments for but that might not match up with all the possible Python versions. It probably depends a bit what this is used for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I changed it now to:

if (is.null(version)) {
    version <- .AnnDataVersions[1]
}

so when it's null it'll match the default python version, but otherwise isn't constrained.

R/read.R Outdated
Comment on lines 313 to 315
# Should we wrap this in suppressWarnings? rhdf5 will warn that it
# can't yet read enum (factor/boolean) in attributes
element_attrs <- rhdf5::h5readAttributes(file, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...maybe. If it's something we handle so the warning isn't meaningful to the user then I guess that would make sense. There's just the risk that we suppress a future warning that it would be useful to know about.

R/read.R Outdated
Comment on lines 323 to 325
# Can't determine orderedness due to rhdf5 not yet supporting
# enums in attributes
#ord <- as.logical(element_attrs[["ordered"]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess this is where the warning comes in. If we aren't really addressing it maybe we should leave it (and try to fix this upstream).

Copy link
Contributor Author

@jackkamm jackkamm Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it would be good to fix this upstream...

Also, I consolidated the 2 comments into a single comment, to try and make this easier to keep track of in future.

@jackkamm
Copy link
Contributor Author

jackkamm commented Mar 7, 2023

Thinking about it some more, I may have been confused about the "version" argument, and maybe it was a mistake to try to use it at all.

After all the python AnnData package is able to read in older AnnData just fine without needing to specify the version. So we should too.

As it stands now, the PR won't properly read AnnData v0.7 when version=NULL, because the old-style conversion for categories only happens when we explicitly specify a version < 0.8.

I'm pretty sure if I remove the compareVersion calls, the PR should be compatible with both old and new AnnData versions automatically.

Let me know if you want me to revert the explicit version handling and I'll update the PR accordingly.

R/read.R Outdated
)
out_cols[[cat_name]] <- factor(out_cols[[cat_name]])
levels(out_cols[[cat_name]]) <- levels
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem I mentioned in my comment just now. AnnData 0.7 categories are only converted if version explicitly specified.

@lazappi
Copy link
Member

lazappi commented Mar 7, 2023

So, in the Python reader the version is there to control which environment is used (and therefore which AnnData file version is read/written). In the R reader we aren't messing around with environments so it maybe isn't needed. It just depends whether it's easier to detect what file version has been used or ask the user to specify it.

If we were looking at a writer it would be a bit different because in that case we would want to give the user control over which file version is written.

@jackkamm
Copy link
Contributor Author

I made 1 more commit, so that the version isn't passed into native R reader anymore. Instead, native R reader just tries to figure out the right thing to do based on the keywords/attributes it sees, like in the original version of this PR.

I think it's probably better this way, so the user doesn't need to explicitly specify the version.

But I can see the argument the other way also. So I leave this last change as a separate unsquashed commit -- feel free to revert it if you prefer the previous approach that explicitly passes in the version.

@lazappi lazappi merged commit 69b1c92 into theislab:master Mar 15, 2023
@lazappi
Copy link
Member

lazappi commented Mar 15, 2023

Thanks for all your work on this! I suspect it might need some tweaks in the future but I have merged what we have so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants