Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO] What is the desired behavior when writing an empty BAM file? #2497

Closed
tloka opened this issue Mar 25, 2021 · 5 comments · Fixed by #3081
Closed

[IO] What is the desired behavior when writing an empty BAM file? #2497

tloka opened this issue Mar 25, 2021 · 5 comments · Fixed by #3081
Assignees
Labels
bug faulty or wrong behaviour of code ready to tackle Fully specified, discussed and understood. Can be tackled.

Comments

@tloka
Copy link
Contributor

tloka commented Mar 25, 2021

Platform

  • SeqAn version: latest
  • Operating system: Ubuntu
  • Compiler: gcc

Question

Just a quick question:
I am using SeqAn3's BAM output. I need to create a seqan3::sam_file_output object even before I know whether there will be valid alignments to write. Thus, in some cases, it can happen that there is no entry being written.

However, the resulting file is actually invalid (at least for samtools), as the header is only written with the first entry. Thus, while the file exists and has a valid EOF marker, samtools returns an error when trying to view such an empty BAM file:

> samtools view [filename]
[main_samview] fail to read the header from "[filename]".

I am using two different samtools version, the error is actually only showing in the newer version (1.10 vs. 1.07). I didn't test the most recent version, but I guess the error is produced by the new HTSLib header which was introduced in samtools 1.10 and performs additional validity checks for the header (see here; Samtools now uses the new HTSlib header API. As this adds more checks for invalid headers, it is possible that some illegal files will now be rejected when they would have been allowed by earlier versions.)

I am wondering whether this is the expected behavior, or whether the header should also be written for empty files to get a valid file as an output!?

Minimal example:

using sam_fields_t = seqan3::fields<seqan3::field::id>;
using sam_format_type_list_t = seqan3::type_list<seqan3::format_sam, seqan3::format_bam>;
using sam_file_output_t = seqan3::sam_file_output<sam_fields_t, sam_format_type_list_t, std::vector<std::string>>;

std::vector<std::string> names {"SEQ1", "SEQ2"};
std::vector<uint16_t> lengths {2463, 8273};
sam_file_output_t out{"/home/tobias/livetools_test/test.bam", names, lengths};
//out.emplace_back("testseq1");  <<< Only including this line would produce the header in the output file

Writing SAM accordingly also produces an empty file without header.

I am wondering if this is expected behavior, as the resulting file seems to not be valid according to samtools. However, I think it should be possible to create a valid empty file, which would also be a nice indicator that the resulting empty file is not due to some error, but does really not contain any alignments.

@tloka tloka added the question a user question how to do certain things label Mar 25, 2021
@marehr
Copy link
Member

marehr commented Mar 25, 2021

Hi @tloka,

this is currently not possible. Thank you for this use case. When re-discussing the design, I had the feeling that it would make sense to explicitly allow writing the alignment header.

Something along the lines:

seqan3::sam_file_output fout {...};
seqan3::sam_file_header header{...};
seqan3::sam_file_record record1{...}
seqan3::sam_file_record record1{...};

fout.push_back(header); // write out header
fout.push_back(record1); // write out record1
fout.push_back(record2); // write out record2

// this will throw, because header can only be written once
fout.push_back(header);

It seems to be a bam specification violation on our side, because BAM requires a header.

The bottom-line is that in any case we should write a valid empty header at deconstruction if no records were provided.

@marehr marehr added bug faulty or wrong behaviour of code and removed question a user question how to do certain things labels Mar 25, 2021
@tloka
Copy link
Contributor Author

tloka commented Mar 25, 2021

The bottom-line is that in any case we should write a valid empty header at deconstruction if no records were provided.

I agree.

For the rest, I am actually not sure it would be necessary/intuitive to construct the header separately and specify the push_back() function for this. For me using push_back() for something that can only be pushed at the beginning feels a bit like a misuse.

I would rather think of something like a function seqan3::sam_file_output::flush_header() to explicitly write the header to the output file. You could even handle the case that the header was already written internally instead of throwing an exception when trying to push back after records were already written.

Somehing like this:

// Example 1: Header is implicitely written with the first record.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
seqan3::sam_file_record record1{...};
fout.push_back(record1);
}

// Example 2: Header is explicitely flushed. Same result as Example 1.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
fout.flush_header();
seqan3::sam_file_record record1{...};
fout.push_back(record1);
}

// Example 3: Empty output file. Valid empty header needs to be written on destruction.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
}

// Example4: Empty output file. However, the full header is included as it was explicitely flushed.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
fout.flush_header();
}

// Example5: Invalid call of flush_header() does nothing. Same result as in Example 1 and 2
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
seqan3::sam_file_record record1{...};
fout.push_back(record1);
fout.flush_header(); // no effect; Header is already flushed with the first record.
}

@marehr
Copy link
Member

marehr commented Mar 25, 2021

(Agree! We still need some way to modify a header that wasn't constructed before.)

@joshuak94
Copy link
Contributor

Hello, can I ask about an update to this? I'd be happy to work on this myself if nobody else is currently doing so. I'd also like "valid" BAM files written when no records are passed.

@smehringer smehringer added the ready to tackle Fully specified, discussed and understood. Can be tackled. label Oct 20, 2022
@eseiler eseiler changed the title What is the desired behavior when writing an empty BAM file? [IO] What is the desired behavior when writing an empty BAM file? Oct 20, 2022
@tloka
Copy link
Contributor Author

tloka commented Nov 2, 2022

Great to see this resolved, really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug faulty or wrong behaviour of code ready to tackle Fully specified, discussed and understood. Can be tackled.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants