Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Allow '*' at the beginning of string fields in SAM format. #2184

Merged
merged 3 commits into from
Oct 16, 2020

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Oct 2, 2020

An error that occurred for @h-2. The offending line was added as a test.

Fixes #2195

include/seqan3/io/alignment_file/format_sam_base.hpp Outdated Show resolved Hide resolved
@@ -407,7 +407,7 @@ inline void format_sam::read_alignment_record(stream_type & stream,
"The ref_offset must be a specialisation of std::optional.");

auto stream_view = views::istreambuf(stream);
auto field_view = stream_view | views::take_until_or_throw_and_consume(is_char<'\t'>);
auto field_view = stream_view | views::take_until_or_throw(is_char<'\t'>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the initial problem was that * indicates and empty SAM field and the function read_field only checks if * is present (before this PR). But I did not check if the field ONLY contains the *.

So I need to check if there is a \t after the *but I cannot do this if the \t is consumed in general (the change you asked the reason for). Note that I cannot check if field_view is empty instead of checking if the next char is \t because on an input_range the take.. view will always "regenerate".

// input_range = "1\t2\t3\n"
auto v = input_range | take_until_and_consume('\t');
seqan3::debug_stream << v << std::endl; // prints 1
seqan3::debug_stream << v << std::endl;  // prints 2

Is that understandable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; unfortunate that we can't keep the consume :(

Copy link
Member

@h-2 h-2 Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only problematic if there are empty fields / double tabs. This is not allowed in SAM/BAM.

Note that in your example vis "at-end" in between the calls, i.e. the first begin iterator does compare true to the end. Otherwise it wouldn't stop printing after "1". Only when you create a new iterator (which implicitly happens in your example) does this new iterator start at the next field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in your example vis "at-end" in between the calls, i.e. the first begin iterator does compare true to the end. Otherwise it wouldn't stop printing after "1". Only when you create a new iterator (which implicitly happens in your example) does this new iterator start at the next field.

Ah ok thanks. That's the fact that I was missing when trying to fix this :)

@h-2
Copy link
Member

h-2 commented Oct 2, 2020

Thank you for looking into this so quickly!

This slightly shorter patch works for me, too;

template <typename stream_view_type, std::ranges::forward_range target_range_type>
inline void format_sam_base::read_field(stream_view_type && stream_view, target_range_type & target)
{
    auto it = std::ranges::begin(stream_view);
    if (it == std::ranges::end(stream_view))
        return;

    char c = *it;
    ++it;

    using val_t = std::ranges::range_value_t<target_range_type>;

    if (it != std::ranges::end(stream_view)) // This is a string with more than one character
    {
        // the first character could be a '*'
        target.push_back(assign_char_to(c, val_t{}));
        std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target));
    }
    else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range
    {
        target.push_back(assign_char_to(c, val_t{}));
    }
}

Just replace read_field with this, no need to change anything else.

format_sam_test and format_bam_test passed for me, also by problematic file.

@smehringer smehringer force-pushed the io_alignment_file_fix_read_field branch from 35a01e5 to b59ef1f Compare October 5, 2020 06:19
@codecov
Copy link

codecov bot commented Oct 5, 2020

Codecov Report

Merging #2184 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2184   +/-   ##
=======================================
  Coverage   98.07%   98.07%           
=======================================
  Files         262      262           
  Lines       10675    10676    +1     
=======================================
+ Hits        10470    10471    +1     
  Misses        205      205           
Impacted Files Coverage Δ
...clude/seqan3/io/alignment_file/format_sam_base.hpp 98.65% <100.00%> (+<0.01%) ⬆️
include/seqan3/argument_parser/argument_parser.hpp 98.67% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17c3033...0372bea. Read the comment docs.

Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but you deleted the test?

@smehringer
Copy link
Member Author

ah damnit

@smehringer smehringer force-pushed the io_alignment_file_fix_read_field branch from 53fcee6 to 01137d8 Compare October 5, 2020 09:12
@h-2
Copy link
Member

h-2 commented Oct 5, 2020

In general quite a few places like https://github.com/seqan/seqan3/blob/master/include/seqan3/io/alignment_file/format_bam.hpp#L954 use very many repeated calls to std::ranges::begin(stream_view) where I think it would be enough to just call auto it = std::ranges::begin(stream_view) once and than work with it. This is

  1. safer, because it prevents the potential "reset" of the stream discussed here
  2. easier to read (++it vs std::ranges::next(std::ranges::begin(stream_view)) )
  3. maybe even a little faster

@smehringer smehringer force-pushed the io_alignment_file_fix_read_field branch from 01137d8 to 2f327cc Compare October 5, 2020 09:38
@smehringer
Copy link
Member Author

In general quite a few places like https://github.com/seqan/seqan3/blob/master/include/seqan3/io/alignment_file/format_bam.hpp#L954 use very many repeated calls to std::ranges::begin(stream_view) where I think it would be enough to just call auto it = std::ranges::begin(stream_view) once and than work with it. This is

1. safer, because it prevents the potential "reset" of the stream discussed here

2. easier to read (`++it` vs `std::ranges::next(std::ranges::begin(stream_view))` )

3. maybe even a little faster

Thanks for the input :) I will add an issue in the backlog about this.

@eseiler eseiler requested a review from rrahn October 5, 2020 10:15
@rrahn
Copy link
Contributor

rrahn commented Oct 8, 2020

An error that occurred for @h-2. The offending line was added as a test.

issue?

Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some simplifications. And I'd like to track this via issues. Because it took a bit to understand what the problem was by just looking at the code.

@@ -328,3 +328,16 @@ TEST_F(sam_format, write_different_header)
EXPECT_EQ(ostream.str(),
"@HD\tVN:1.6\tSO:coordinate\tSS:query\tGO:reference\n@SQ\tSN:ref\tLN:34\n*\t0\tref\t1\t0\t*\t*\t0\t0\t*\t*\n");
}

TEST_F(sam_format, quality_string_starts_with_asterics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a tracking issue for this.

Comment on lines 378 to 389
char c = *it;
++it;

if (it != std::ranges::end(stream_view)) // This is a string with more than one character
{
target.push_back(assign_char_to(c, val_t{}));
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target));
}
else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range
{
target.push_back(assign_char_to(c, val_t{}));
}
Copy link
Contributor

@rrahn rrahn Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
char c = *it;
++it;
if (it != std::ranges::end(stream_view)) // This is a string with more than one character
{
target.push_back(assign_char_to(c, val_t{}));
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target));
}
else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range
{
target.push_back(assign_char_to(c, val_t{}));
}
// Write to target if field does not represent an empty string, denoted as single '*' character.
if (char c = *it; !(++it == std::ranges::end(view) && c == '*'))
{
target.push_back(seqan3::assign_char_to(c, val_t{}));
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for all cases, because -- as discussed above -- you are calling std::ranges::begin() a second time implicitly in line 382 (when iterating over stream_view). In the case of an underlying view that is consuming and it was at end, this will read "past-the-end".

Another issue with this code is, that in the case where c != '*' you are not incrementing the iterator (because of lazy-evaluation of &&) and thus writing c to the target twice.

I don't like have a line twice in the code, but I think it is the easiest (correct) solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for all cases, because -- as discussed above -- you are calling std::ranges::begin() a second time implicitly in line 382 (when iterating over stream_view). In the case of an underlying view that is consuming and it was at end, this will read "past-the-end".

Oh, that is actually true. Good point. But nothing that can be easily fixed by calling the right interface:
Since we already have the iterator we can just use it to form a subrange which does not consume anything

std::ranges::copy(std::ranges::subrange<...>{it, std::ranges::end(view)} | views::char_to<val_t>, ...);

That should work right?

Another issue with this code is, that in the case where c != '*' you are not incrementing the iterator (because of lazy-evaluation of &&) and thus writing c to the target twice.

Yeah, that's right. I had the order the other way around, which schould be guaranteed to work. In the end, I thought it is nicer to have the char check before but this would introduce a subtle bug indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work right?

Yes, the proposed solution should work and is more elegant! Although I think it is considerably more difficult to parse for humans while the slightly longer one is a lot more straight-forward (just thinking about how easy it was for you to make this mistake and that I needed to think hard on whether it works now 🤓 ).

But I will leave others to judge 😉 I have no strong preference about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that when writing an algorithm with ranges you should pay attention to when you call begin. I don't think that more verbose code makes the actual problem clearer, as it is quite subtle. At least not for someone who is reading the code for the first time. So I would support the concise version with a clearer comment at the beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, the introduced issues make up some good test cases that should be added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, the introduced issues make up some good test cases that should be added.

I don't really see the good test cases. The function in question is already used with a consuming and a non-consuming input range, both empty (*) in some cases and non-empty in others. The case of a field starting with * but not being empty was also added (I added additionally added for the id field which is consuming in the test case - that one is new).

std::cpp20::back_inserter(target));
else
std::ranges::next(std::ranges::begin(stream_view)); // skip '*'
using val_t = std::ranges::range_value_t<target_range_type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using val_t = std::ranges::range_value_t<target_range_type>;
using target_range_value_t = std::ranges::range_value_t<target_range_type>;

@smehringer smehringer requested a review from rrahn October 9, 2020 07:44
@smehringer
Copy link
Member Author

@rrahn polite ping

Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks in general fine. Some small suggestions. You can already rebase.

include/seqan3/io/alignment_file/format_sam_base.hpp Outdated Show resolved Hide resolved
test/unit/io/alignment_file/format_sam_test.cpp Outdated Show resolved Hide resolved
@smehringer smehringer force-pushed the io_alignment_file_fix_read_field branch from e8a2067 to aedb691 Compare October 15, 2020 14:11
@smehringer smehringer requested a review from rrahn October 15, 2020 14:12
@smehringer smehringer force-pushed the io_alignment_file_fix_read_field branch from aedb691 to 0372bea Compare October 15, 2020 14:50
@smehringer smehringer merged commit a78360d into seqan:master Oct 16, 2020
@smehringer smehringer deleted the io_alignment_file_fix_read_field branch November 4, 2020 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SAM/BAM IO: Quality string starting with '*' mess up the parsing
4 participants