Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Option to ignore blanks before ids when reading FastA-Files #2770

Merged
merged 2 commits into from
Apr 27, 2022

Conversation

SGSSGene
Copy link
Contributor

@SGSSGene SGSSGene commented Aug 23, 2021

A simple roundtrip through seqan3 (reading and writing a fasta file) should not introduce any changes with the default options

auto fout = seqan3::sequence_file_input{std::istringstream{input}, seqan3::format_fasta{}} |
                  seqan3::sequence_file_output{std::ostringstream{}, seqan3::format_fasta{}};

The PR #2769 fixes the forced introduction of spaces before the sequence ids.
This PR fixes the removal of whitespaces when reading a fasta file.
This changes the default behavior. To achieve the old behavior a new flag is introduced:
fin.options.fasta_ignore_blank_before_id which is by default true.

Example:
File

> seq1
ACTG

seqan3 now: id="seq1"
some others: id=" seq1"

@vercel
Copy link

vercel bot commented Aug 23, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/seqan/seqan3/Fugjpfm7cJd6WKCsQNp6ZVmKY2H4
✅ Preview: https://seqan3-git-fork-sgssgene-feat-removeblankbeforeid-seqan.vercel.app

@SGSSGene SGSSGene requested review from a team and MitraDarja and removed request for a team August 23, 2021 11:15
@codecov
Copy link

codecov bot commented Aug 23, 2021

Codecov Report

Merging #2770 (048469c) into master (b4984bc) will decrease coverage by 0.03%.
The diff coverage is 72.22%.

@@            Coverage Diff             @@
##           master    #2770      +/-   ##
==========================================
- Coverage   98.22%   98.19%   -0.04%     
==========================================
  Files         267      267              
  Lines       11511    11521      +10     
==========================================
+ Hits        11307    11313       +6     
- Misses        204      208       +4     
Impacted Files Coverage Δ
include/seqan3/io/sequence_file/format_fasta.hpp 89.38% <72.22%> (-2.86%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5c63f9...048469c. Read the comment docs.

Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one minor thing. :)

@eseiler eseiler requested a review from marehr August 25, 2021 15:37
@smehringer
Copy link
Member

Core Meeting 25.10.2021 - The default should still be "blanks are stripped" because usually blanks are usually ignored. the option is still useful to allow for a "perfect roundtrip".

@SGSSGene
Copy link
Contributor Author

I will unchange default to 'true'

@SGSSGene SGSSGene changed the base branch from master to release-3.1.0 October 26, 2021 08:39
@SGSSGene SGSSGene force-pushed the feat/remove_blank_before_id branch 3 times, most recently from e9a0f99 to c42ce88 Compare October 26, 2021 08:43
{}
if (options.fasta_ignore_blanks_before_id)
{
for (; (it != e) && (is_id || is_blank)(*it); ++it)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you do something like

auto const is_id = options.fasta_ignore_blanks_before_id ? (is_char<'>'> || is_char<';'> || is_blank) : (is_char<'>'> || is_char<';'>);

in line 178

Copy link
Contributor Author

@SGSSGene SGSSGene Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is not possible. Line 180 checks if the character is a > or ;

        if (!is_id(*begin(stream_view)))

That line would be wrong if it included a whitespace.

I'm just noticing the parsing of ID is not correct either way.
Currently >>>>>>>> TEST would be parsed as TEST.
and with my changes: > > >>> > > > TEST would also be parsed as TEST

I am assuming that is also not what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a new PR to fix this: #2869

@eseiler eseiler changed the base branch from release-3.1.0 to master November 10, 2021 10:58
@smehringer smehringer requested review from a team and removed request for marehr and a team December 6, 2021 10:27
@SGSSGene
Copy link
Contributor Author

#2769 is merged, so this can be rebased? (This PR was marked blocked by #2769)

Its blocked by #2869

@SGSSGene
Copy link
Contributor Author

#2769 is merged, so this can be rebased? (This PR was marked blocked by #2769)

Its blocked by #2869

No, this is not blocked by #2869, this can be independently be merged :-)

Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge #2869 first, I want to have another look at the predicates afterwards :)

@eseiler eseiler force-pushed the feat/remove_blank_before_id branch from 1983b08 to f886ce0 Compare March 23, 2022 15:13
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find an elegant way to avoid code duplication

Copy link
Member

@smehringer smehringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there is no change in behavior, I would add a short changelog entry for this new feature

Comment on lines +64 to +59
">TEST 1\n"
"ACGT\n"
"> Test2\n"
"AGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGNAGGCTGN\n"
"> Test3\n"
"GGAGTATAATATATATATATATAT\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be even more understandable if you use this as input AND output string. So one does not have to compare input and output_comp when looking at it but directly notices the "perfect roundtrip".

Or do you explicitly want to show that spaces in the sequence are still removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will adjust. 👍

@smehringer
Copy link
Member

there is a conflict. Please rebase

@smehringer smehringer merged commit 79b6467 into seqan:master Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants