Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Parsing BAM headers with 64 reference sequences #2423

Merged
merged 1 commit into from
Mar 4, 2021

Conversation

eseiler
Copy link
Member

@eseiler eseiler commented Mar 3, 2021

Resolves #2422

Todo:

  • BAM works, SAM does not
  • Test
  • Changelog
  • Check for performance regression

@vercel
Copy link

vercel bot commented Mar 3, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/seqan/seqan3/86UZWL3SkiF2gqcubyBuS4Zh1gdz
✅ Preview: https://seqan3-git-fork-eseiler-fix-bam2-seqan.vercel.app

};

while (is_char<'@'>(*it))
while (it != end && is_char<'@'>(*it))
Copy link
Member Author

@eseiler eseiler Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it != end is used in BAM, the stream_view is a view_take_exactly_or_throw and we need to stop at the end (the size is known).

is_char<'@'>(*it) is used in SAM, the stream_view is a view over the file and we stop when we read all lines that start with @.

@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #2423 (674a178) into master (3cd33a0) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2423   +/-   ##
=======================================
  Coverage   98.23%   98.24%           
=======================================
  Files         267      267           
  Lines       10900    10917   +17     
=======================================
+ Hits        10708    10725   +17     
  Misses        192      192           
Impacted Files Coverage Δ
include/seqan3/io/sam_file/format_sam_base.hpp 98.12% <100.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 803ee1d...7d0230f. Read the comment docs.

@eseiler eseiler marked this pull request as ready for review March 3, 2021 11:38
@eseiler eseiler requested review from a team and joergi-w and removed request for a team March 3, 2021 11:39
@@ -179,6 +179,47 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data
'\x66', '\x66', '\x66', '\x46', '\x40', '\x7A', '\x7A', '\x5A', '\x73', '\x74', '\x72', '\x00'
};

std::string many_refs{[] ()
Copy link
Member Author

@eseiler eseiler Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BAM would be multiple hundred lines long, that's why it is generated.

The comments + BAM specification make it feasible to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have also generated a bam file and added it as file :D

@@ -80,6 +80,7 @@ If possible, provide tooling that performs the changes, e.g. a shell-script.

* Requesting the alignment without also requesting the sequence for BAM files containing empty CIGAR strings does now
not result in erroneous parsing ([\#2418](https://github.com/seqan/seqan3/pull/2418)).
* BAM files with 64 references are now parsed correctly ([\#2423](https://github.com/seqan/seqan3/pull/2423)).
Copy link
Member Author

@eseiler eseiler Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's every number of references whose last byte is \x40, e.g. it's also affecting 320, but I didn't come up with a concise way to describe this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mod 64) ?

Copy link
Member Author

@eseiler eseiler Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it works properly for 128. The first/last(?) byte has to look like 01000000 or 11000000, i.e. the first least significant bit to be set is the seventh...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right: (mod 256) == 64

Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this PR regarding the syntax and what I know about I/O, but please get another review from someone who knows the sam/bam specific semantics better than me...

Comment on lines 476 to 514
auto take_until_colon_and_consume = [&it] ()
{
while (!is_char<':'>(*it))
++it;
++it;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

views::take_until(is_char<':'>) doesn't work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, see the issue.
We would make a copy of the stream_view and hence not update the size of the orignal stream_view

@eseiler eseiler requested review from a team and marehr and removed request for a team March 3, 2021 17:30
Copy link
Member

@marehr marehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you :)

@@ -80,6 +80,7 @@ If possible, provide tooling that performs the changes, e.g. a shell-script.

* Requesting the alignment without also requesting the sequence for BAM files containing empty CIGAR strings does now
not result in erroneous parsing ([\#2418](https://github.com/seqan/seqan3/pull/2418)).
* BAM files with 64 references are now parsed correctly ([\#2423](https://github.com/seqan/seqan3/pull/2423)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mod 64) ?

include/seqan3/io/sam_file/format_sam_base.hpp Outdated Show resolved Hide resolved
include/seqan3/io/sam_file/format_sam_base.hpp Outdated Show resolved Hide resolved
include/seqan3/io/sam_file/format_sam_base.hpp Outdated Show resolved Hide resolved
include/seqan3/io/sam_file/format_sam_base.hpp Outdated Show resolved Hide resolved
@@ -459,16 +459,34 @@ inline void format_sam_base::read_header(stream_view_type && stream_view,
ref_seqs_type & /*ref_id_to_pos_map*/)
{
auto it = std::ranges::begin(stream_view);
auto end = std::ranges::end(stream_view);
std::string string_buffer{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use vector here, because std::string can be quite slower because of the small string optimization

@@ -179,6 +179,47 @@ struct sam_file_read<seqan3::format_bam> : public sam_file_data
'\x66', '\x66', '\x66', '\x46', '\x40', '\x7A', '\x7A', '\x5A', '\x73', '\x74', '\x72', '\x00'
};

std::string many_refs{[] ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have also generated a bam file and added it as file :D

Copy link
Member

@marehr marehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

@marehr marehr merged commit a0e4787 into seqan:master Mar 4, 2021
@eseiler eseiler deleted the fix/bam2 branch May 14, 2021 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SeqAn parses BAM headers past the value given in l_text
3 participants