-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version of num records #295
Conversation
Also needs CFLAGS="-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION" which could be added to setup.py
@@ -78,7 +81,7 @@ cdef extern from "htslib/hts.h": | |||
cdef extern from "htslib/tbx.h": | |||
|
|||
ctypedef struct tbx_t: | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this so that Cython "knows about" the idx member of this struct. However, it has two other members - I'm not sure if it's safe to just ignore them? Originally I copied the full struct definition in here, but that seems worse, if it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to find documentation of this in cython, but failed. My memory is that you should either declare none of the fields or all of the fields. The definition of tbx_t
is simple enough that it is at least it is easy to explicitly list all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
idx = self.idx.idx | ||
assert self.hidx == NULL | ||
|
||
if idx == NULL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit about this, and since you need an index to compute this, it seems best to make it a hard error so that the user is alerted. Happy to discuss of course (also the type of the exception is debateable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine with me. You could just document in num_records
docstring that is will raise a ValueError if index is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cyvcf2/tests/test_reader.py
Outdated
assert len(list(b)) == 10 | ||
|
||
b.set_index("{}/test-diff.csi".format(HERE)) | ||
print(b.num_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails on the hts_idx_get_stat
call, but it's not clear why. I'm assuming this test case is used to check what happens when you automatically load the wrong index, but then replace with a good one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it would fail on the first call either!?
So the print on 1365 works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you could try reindexing, make a new test.snpeff.bcf.csi and see if it still fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was pretty weird. It seems that the return value of hts_idx_get_stat
isn't actually diagnostic of errors, and bcftools index -n still reports a value when it is negative. So, I think we just have to ignore it. This test.snpeff.bcf
file works fine when we do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks very much @jeromekelleher
Would like to understand the one failure.
@@ -78,7 +81,7 @@ cdef extern from "htslib/hts.h": | |||
cdef extern from "htslib/tbx.h": | |||
|
|||
ctypedef struct tbx_t: | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to find documentation of this in cython, but failed. My memory is that you should either declare none of the fields or all of the fields. The definition of tbx_t
is simple enough that it is at least it is easy to explicitly list all of them.
idx = self.idx.idx | ||
assert self.hidx == NULL | ||
|
||
if idx == NULL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine with me. You could just document in num_records
docstring that is will raise a ValueError if index is not found.
cyvcf2/cyvcf2.pyx
Outdated
cdef int ret, tid, nseq; | ||
cdef hts_idx_t *idx = NULL; | ||
|
||
if self.hidx == NULL and self.idx == NULL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not required for this PR, but perhaps we should extract out this stuff to a _find_index
method since it's repeated elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created an _open_index
function that's shared between num_records and seqnames, as they are very similar. I'm a bit wary of the other places the indexes are used, as it's really not clear to me when you have to use the tbx struct (which has an embedded hts_idx) and when you can use the other for queries. I thought it best to leave well enough alone for now.
cyvcf2/tests/test_reader.py
Outdated
assert len(list(b)) == 10 | ||
|
||
b.set_index("{}/test-diff.csi".format(HERE)) | ||
print(b.num_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it would fail on the first call either!?
So the print on 1365 works?
cyvcf2/tests/test_reader.py
Outdated
assert len(list(b)) == 10 | ||
|
||
b.set_index("{}/test-diff.csi".format(HERE)) | ||
print(b.num_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you could try reindexing, make a new test.snpeff.bcf.csi and see if it still fails
Update testing for num_records
Remove return value check, as it wasn't diagnostic
OK, I think this is ready for another look @brentp ! I added an extra index-freeing step in |
nicely done. will tag a new version shortly. |
Awesome, thanks! |
I nerd-sniped myself into doing #294 😆
It seems to be mostly working, but I'm not familiar either with Cython or htslib, so I'm not that confident there's no nasty surprises lurking. The logic around opening indexes is tricky: I've tried to reproduce what's used elsewhere in the file, but I'm really not clear about the lifetimes etc.
It's also not clear to me when this fails - does it need to have an index created by a recent version of bcftools?
I'd like to do some more testing, where we check it against more bcf files, and ideally with indexes created by different tools.
I'll make a few inline comments on specifics.
I also fixed a few compiler warnings in 191654e - happy to drop this commit if it's off-topic.