-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Parsing BAM headers with 64 reference sequences #2423
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/seqan/seqan3/86UZWL3SkiF2gqcubyBuS4Zh1gdz |
}; | ||
|
||
while (is_char<'@'>(*it)) | ||
while (it != end && is_char<'@'>(*it)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it != end
is used in BAM, the stream_view
is a view_take_exactly_or_throw
and we need to stop at the end (the size is known).
is_char<'@'>(*it)
is used in SAM, the stream_view
is a view over the file and we stop when we read all lines that start with @
.
Codecov Report
@@ Coverage Diff @@
## master #2423 +/- ##
=======================================
Coverage 98.23% 98.24%
=======================================
Files 267 267
Lines 10900 10917 +17
=======================================
+ Hits 10708 10725 +17
Misses 192 192
Continue to review full report at Codecov.
|
@@ -179,6 +179,47 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data | |||
'\x66', '\x66', '\x66', '\x46', '\x40', '\x7A', '\x7A', '\x5A', '\x73', '\x74', '\x72', '\x00' | |||
}; | |||
|
|||
std::string many_refs{[] () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BAM would be multiple hundred lines long, that's why it is generated.
The comments + BAM specification make it feasible to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have also generated a bam file and added it as file :D
@@ -80,6 +80,7 @@ If possible, provide tooling that performs the changes, e.g. a shell-script. | |||
|
|||
* Requesting the alignment without also requesting the sequence for BAM files containing empty CIGAR strings does now | |||
not result in erroneous parsing ([\#2418](https://github.com/seqan/seqan3/pull/2418)). | |||
* BAM files with 64 references are now parsed correctly ([\#2423](https://github.com/seqan/seqan3/pull/2423)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's every number of references whose last byte is \x40
, e.g. it's also affecting 320
, but I didn't come up with a concise way to describe this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mod 64) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it works properly for 128. The first/last(?) byte has to look like 01000000
or 11000000
, i.e. the first least significant bit to be set is the seventh...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right: (mod 256) == 64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this PR regarding the syntax and what I know about I/O, but please get another review from someone who knows the sam/bam specific semantics better than me...
auto take_until_colon_and_consume = [&it] () | ||
{ | ||
while (!is_char<':'>(*it)) | ||
++it; | ||
++it; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
views::take_until(is_char<':'>)
doesn't work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see the issue.
We would make a copy of the stream_view
and hence not update the size of the orignal stream_view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you :)
@@ -80,6 +80,7 @@ If possible, provide tooling that performs the changes, e.g. a shell-script. | |||
|
|||
* Requesting the alignment without also requesting the sequence for BAM files containing empty CIGAR strings does now | |||
not result in erroneous parsing ([\#2418](https://github.com/seqan/seqan3/pull/2418)). | |||
* BAM files with 64 references are now parsed correctly ([\#2423](https://github.com/seqan/seqan3/pull/2423)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mod 64) ?
@@ -459,16 +459,34 @@ inline void format_sam_base::read_header(stream_view_type && stream_view, | |||
ref_seqs_type & /*ref_id_to_pos_map*/) | |||
{ | |||
auto it = std::ranges::begin(stream_view); | |||
auto end = std::ranges::end(stream_view); | |||
std::string string_buffer{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use vector here, because std::string can be quite slower because of the small string optimization
@@ -179,6 +179,47 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data | |||
'\x66', '\x66', '\x66', '\x46', '\x40', '\x7A', '\x7A', '\x5A', '\x73', '\x74', '\x72', '\x00' | |||
}; | |||
|
|||
std::string many_refs{[] () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have also generated a bam file and added it as file :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
Resolves #2422
Todo: