-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Allow '*' at the beginning of string fields in SAM format. #2184
[FIX] Allow '*' at the beginning of string fields in SAM format. #2184
Conversation
@@ -407,7 +407,7 @@ inline void format_sam::read_alignment_record(stream_type & stream, | |||
"The ref_offset must be a specialisation of std::optional."); | |||
|
|||
auto stream_view = views::istreambuf(stream); | |||
auto field_view = stream_view | views::take_until_or_throw_and_consume(is_char<'\t'>); | |||
auto field_view = stream_view | views::take_until_or_throw(is_char<'\t'>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the initial problem was that *
indicates and empty SAM field and the function read_field
only checks if *
is present (before this PR). But I did not check if the field ONLY contains the *
.
So I need to check if there is a \t
after the *
but I cannot do this if the \t
is consumed in general (the change you asked the reason for). Note that I cannot check if field_view
is empty instead of checking if the next char is \t
because on an input_range the take..
view will always "regenerate".
// input_range = "1\t2\t3\n"
auto v = input_range | take_until_and_consume('\t');
seqan3::debug_stream << v << std::endl; // prints 1
seqan3::debug_stream << v << std::endl; // prints 2
Is that understandable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; unfortunate that we can't keep the consume :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only problematic if there are empty fields / double tabs. This is not allowed in SAM/BAM.
Note that in your example v
is "at-end" in between the calls, i.e. the first begin iterator does compare true to the end. Otherwise it wouldn't stop printing after "1". Only when you create a new iterator (which implicitly happens in your example) does this new iterator start at the next field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in your example vis "at-end" in between the calls, i.e. the first begin iterator does compare true to the end. Otherwise it wouldn't stop printing after "1". Only when you create a new iterator (which implicitly happens in your example) does this new iterator start at the next field.
Ah ok thanks. That's the fact that I was missing when trying to fix this :)
Thank you for looking into this so quickly! This slightly shorter patch works for me, too; template <typename stream_view_type, std::ranges::forward_range target_range_type>
inline void format_sam_base::read_field(stream_view_type && stream_view, target_range_type & target)
{
auto it = std::ranges::begin(stream_view);
if (it == std::ranges::end(stream_view))
return;
char c = *it;
++it;
using val_t = std::ranges::range_value_t<target_range_type>;
if (it != std::ranges::end(stream_view)) // This is a string with more than one character
{
// the first character could be a '*'
target.push_back(assign_char_to(c, val_t{}));
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target));
}
else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range
{
target.push_back(assign_char_to(c, val_t{}));
}
} Just replace read_field with this, no need to change anything else. format_sam_test and format_bam_test passed for me, also by problematic file. |
35a01e5
to
b59ef1f
Compare
Codecov Report
@@ Coverage Diff @@
## master #2184 +/- ##
=======================================
Coverage 98.07% 98.07%
=======================================
Files 262 262
Lines 10675 10676 +1
=======================================
+ Hits 10470 10471 +1
Misses 205 205
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but you deleted the test?
ah damnit |
53fcee6
to
01137d8
Compare
In general quite a few places like https://github.com/seqan/seqan3/blob/master/include/seqan3/io/alignment_file/format_bam.hpp#L954 use very many repeated calls to
|
01137d8
to
2f327cc
Compare
Thanks for the input :) I will add an issue in the backlog about this. |
issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some simplifications. And I'd like to track this via issues. Because it took a bit to understand what the problem was by just looking at the code.
@@ -328,3 +328,16 @@ TEST_F(sam_format, write_different_header) | |||
EXPECT_EQ(ostream.str(), | |||
"@HD\tVN:1.6\tSO:coordinate\tSS:query\tGO:reference\n@SQ\tSN:ref\tLN:34\n*\t0\tref\t1\t0\t*\t*\t0\t0\t*\t*\n"); | |||
} | |||
|
|||
TEST_F(sam_format, quality_string_starts_with_asterics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have a tracking issue for this.
char c = *it; | ||
++it; | ||
|
||
if (it != std::ranges::end(stream_view)) // This is a string with more than one character | ||
{ | ||
target.push_back(assign_char_to(c, val_t{})); | ||
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target)); | ||
} | ||
else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range | ||
{ | ||
target.push_back(assign_char_to(c, val_t{})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char c = *it; | |
++it; | |
if (it != std::ranges::end(stream_view)) // This is a string with more than one character | |
{ | |
target.push_back(assign_char_to(c, val_t{})); | |
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target)); | |
} | |
else if (c != '*') // single '*' means "empty sequence"; other character is 1-element range | |
{ | |
target.push_back(assign_char_to(c, val_t{})); | |
} | |
// Write to target if field does not represent an empty string, denoted as single '*' character. | |
if (char c = *it; !(++it == std::ranges::end(view) && c == '*')) | |
{ | |
target.push_back(seqan3::assign_char_to(c, val_t{})); | |
std::ranges::copy(stream_view | views::char_to<val_t>, std::cpp20::back_inserter(target)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for all cases, because -- as discussed above -- you are calling std::ranges::begin()
a second time implicitly in line 382 (when iterating over stream_view
). In the case of an underlying view that is consuming and it was at end, this will read "past-the-end".
Another issue with this code is, that in the case where c != '*'
you are not incrementing the iterator (because of lazy-evaluation of &&
) and thus writing c
to the target twice.
I don't like have a line twice in the code, but I think it is the easiest (correct) solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for all cases, because -- as discussed above -- you are calling std::ranges::begin() a second time implicitly in line 382 (when iterating over stream_view). In the case of an underlying view that is consuming and it was at end, this will read "past-the-end".
Oh, that is actually true. Good point. But nothing that can be easily fixed by calling the right interface:
Since we already have the iterator we can just use it to form a subrange which does not consume anything
std::ranges::copy(std::ranges::subrange<...>{it, std::ranges::end(view)} | views::char_to<val_t>, ...);
That should work right?
Another issue with this code is, that in the case where c != '*' you are not incrementing the iterator (because of lazy-evaluation of &&) and thus writing c to the target twice.
Yeah, that's right. I had the order the other way around, which schould be guaranteed to work. In the end, I thought it is nicer to have the char check before but this would introduce a subtle bug indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work right?
Yes, the proposed solution should work and is more elegant! Although I think it is considerably more difficult to parse for humans while the slightly longer one is a lot more straight-forward (just thinking about how easy it was for you to make this mistake and that I needed to think hard on whether it works now 🤓 ).
But I will leave others to judge 😉 I have no strong preference about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that when writing an algorithm with ranges you should pay attention to when you call begin. I don't think that more verbose code makes the actual problem clearer, as it is quite subtle. At least not for someone who is reading the code for the first time. So I would support the concise version with a clearer comment at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow, the introduced issues make up some good test cases that should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow, the introduced issues make up some good test cases that should be added.
I don't really see the good test cases. The function in question is already used with a consuming and a non-consuming input range, both empty (*
) in some cases and non-empty in others. The case of a field starting with *
but not being empty was also added (I added additionally added for the id field which is consuming in the test case - that one is new).
std::cpp20::back_inserter(target)); | ||
else | ||
std::ranges::next(std::ranges::begin(stream_view)); // skip '*' | ||
using val_t = std::ranges::range_value_t<target_range_type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using val_t = std::ranges::range_value_t<target_range_type>; | |
using target_range_value_t = std::ranges::range_value_t<target_range_type>; |
@rrahn polite ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks in general fine. Some small suggestions. You can already rebase.
e8a2067
to
aedb691
Compare
aedb691
to
0372bea
Compare
An error that occurred for @h-2. The offending line was added as a test.
Fixes #2195