Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cram support #54

Open
1 of 4 tasks
brentp opened this issue Dec 20, 2016 · 32 comments
Open
1 of 4 tasks

cram support #54

brentp opened this issue Dec 20, 2016 · 32 comments

Comments

@brentp
Copy link
Contributor

brentp commented Dec 20, 2016

we are starting to have a push to shift to CRAM over BAM for storage reasons. I know this will be a huge development effort, but perhaps we can start with a few discrete tasks. Here are some obvious ones:

  • implement golomb-rice en/decoding
  • implement itf8 en/decoding
  • Elias gamma en/decoding
  • sub exponential en/decoding

I could prototype some of those if we agree on an API and if there is interest in having cram.

@kortschak
Copy link
Member

I have no objection to CRAM support, it's just that I don't use it and the spec is pretty big. In the long run I guess we should support it.

Would you like to write up a proposal for the API here?

@brentp
Copy link
Contributor Author

brentp commented Dec 31, 2016

Yes, I will write up a proposal here. A cursory look shows that a cursory look is insufficient for me to get a good handle on what is needed, but I'll tackle this in January.

@brentp
Copy link
Contributor Author

brentp commented Jan 5, 2017

For the codecs I listed above, do you agree that BinaryUnmarshaler is the obvious interface?

there will be a lot of machinery around that for CRAM, but implementing each of those is no small task either...

@kortschak
Copy link
Member

For the codecs I listed above, do you agree that BinaryUnmarshaler is the obvious interface?

That seems fair in the first instance.

@kortschak
Copy link
Member

Does that have implications on the bam writing method name for hts/bam and the implementation of BinaryUnmarshaler for sam.Record?

@brentp
Copy link
Contributor Author

brentp commented Jan 5, 2017

I'm not sure.
After looking at it more, I thought maybe

type Decoder interface {
    UnmarshalInts([]int) (n, err)
    UnmarshalBytes([]bytes) (n, err)
}

might be a better interface unless we unsafely convert the []bytes.

After that, I was struck by the amount of work this will be.

@kortschak
Copy link
Member

After that, I was struck by the amount of work this will be.

Indeed. I spent a day reading and sketching about a year back and decided I didn't care enough.

@sb10
Copy link

sb10 commented Jan 25, 2017

I'll note that the Sanger Institute now produces cram as our "raw" sequencing data, that the EBI's sequencing archive is pushing toward cram (and may have to go back and convert everything to cram to save space), and major collaborative projects tend to use cram as their interchange format.

Sooner rather than later, cram support is going to be vital. I'd be interested in investigating the use of this go API for use at the Sanger, but without cram support it's a non-starter. I'd suggest looking at how C htslib added sort-of-seamless cram support (in a relatively short amount of time?), and make cram/bam support in this API even more seamless.

@kortschak
Copy link
Member

@sb10 I had a look at the htslib cram fs and it is not trivial - as is the case for the spec. Yes, it's something that we should have (hence this issue), but developer resources are limited. Just the codec code is 6.3ksloc of dense htslib-style C and the entire fs is 14.5ksloc. An alternative I used previously would be to interface to the C (this is what sambamba does) - this is not an option here since we want hts to be pure Go and the data representation differences would require significant massaging at the interface.

@jkbonfield
Copy link

jkbonfield commented Jan 26, 2017

The are two C implementations with a shared common history:

https://github.com/samtools/htslib/tree/develop/cram
https://github.com/jkbonfield/io_lib/tree/master/io_lib

Io_lib was the original, which got copied to htslib, but some divergence happened immediately thereafter (htslib-ifying the code basically). Both have been maintained since then.

My code can never generate gamma, elias gamma or subexponential coded data and frankly it was an accident that I left in the beta code. Furthermore it only uses huffman in the most trivial case of one single symbol with 0-bit length code. It's entirely up to you which you use for the encoder too - so start with the minimal necessary. I'm not sure which codecs the Java implementation can use, but it's reduced over time. The basic premise now is that ideally the CORE block shouldn't be used at all as it harms partial the decoding ability (eg "I don't care about qualities and aux tags, but get me the rest").

I'd also suggest you use the cram_dump -v tool from Staden io_lib as this will basically walk you through the steps being taken when decoding a record. I'd still be interested though to know where this explains something that the specification didn't! We fixed a lot of omissions in the early days, but for sure there are many more in the spec left, but it's harder to spot them once you've learnt how it works.

PS. Learn from my mistake - separate out reference handling into its own module, preferably not CRAM specific as lots of tools need reference querying. It's a complicated enough beast by itself that it warrants being its own thing. I think that code has been my biggest source of bugs!

@sb10
Copy link

sb10 commented Jan 26, 2017

@kortschak I agree that remaining pure Go is an important feature.

If you do find the time and resources to work on this though, as @jkbonfield alludes to, the current implementations are more important than the spec (which may need to be improved).

So just stick to codecs that are already implemented in Go (if possible); James told me the main ones used right now are gzip, bzip2, lzma, rANS, huffman in single-code-only mode (so not huffman at all; just "const").

As long as it passes round-trip tests with samtools, it will be good enough for use.

@brentp
Copy link
Contributor Author

brentp commented Jan 26, 2017

@jkbonfield thanks for the insight. so what codecs are seen in the wild?
just golomb-rice (and itf8?)?

@kortschak maybe biogo/hts could get a summer of code student to work on cram support.

@brentp
Copy link
Contributor Author

brentp commented Feb 1, 2017

Here's my immediate method for supporting cram in the short term:

type reader struct {
    io.ReadCloser
    cmd *exec.Cmd
}

func (r *reader) Close() error {
    if err := r.cmd.Wait(); err != nil {
        return err
    }
    return r.ReadCloser.Close()
}

// NewReader returns a bam.Reader from any path that samtools can read.
func NewReader(path string, rd int, fasta string) (*bam.Reader, error) {
    cmd := exec.Command("samtools", "view", "-T", fasta, "-b", "-u", "-h", path)
    cmd.Stderr = os.Stderr
    pipe, err := cmd.StdoutPipe()
    if err != nil {
        return nil, err
    }
    if err = cmd.Start(); err != nil {
        pipe.Close()
        return nil, err
    }
    cr := &reader{ReadCloser: pipe, cmd: cmd}
    return bam.NewReader(cr, rd)
}

@kortschak
Copy link
Member

kortschak commented Feb 1, 2017 via email

@jkbonfield
Copy link

jkbonfield commented Feb 2, 2017

API-wise, things to consider for reading would be a way to indicate which fields are required by the caller and which are not (eg maybe we don't care about qualities, or aux tags) and whether there is a need to generate MD and NM tags. This is irrelevant in BAM as it costs the same time regardless, but can be significant in CRAM (consider samtools flagstat timings).

On the encoding side there are more options than BAMs compression level; basic CRAM compression level too of course but also how many reads/bases per slice, whether to do reference based encoding, embedded references or referenceless, what compression codecs you are willing to make use of (zlib for sure, but also maybe bzip2 and/or lzma have their uses and are part of CRAM spec).

These (except currently bzip2/lzma) are all controllable directly from samtools command line too via the --input-fmt-option and --output-fmt-option arguments, or as -I and -O format adjustments. Eg -O cram,seqs_per_slice=1000. So you can still generate an API capable of setting these things while falling back to an external implementation for now.

@kortschak
Copy link
Member

The issue with the API above is that it takes a path rather than an io.Reader. It would have to be marked as temporary from the outset.

@kortschak
Copy link
Member

@jkbonfield

API-wise, things to consider for reading would be a way to indicate which fields are required by the caller and which are not (eg maybe we don't care about qualities, or aux tags) and whether there is a need to generate MD and NM tags. This is irrelevant in BAM as it costs the same time regardless, but can be significant in CRAM (consider samtools flagstat timings).

At the moment the biogo/bam.Reader supports field omission based on whether the fields are in bam1_core_t, are variable length (sequence and quality), or are auxilliary (exposed here). This is based on profiling which shows that the variable length data and particularly auxiliary fields are particularly expensive as a proportion of a record's reading.

Are you thinking of per field or at the level of granularity we have here?

@brentp
Copy link
Contributor Author

brentp commented Feb 8, 2017

I have started a stub proposal here that's intentionally vague: https://gist.github.com/brentp/57850c10342c21e9b60fa2f83999464f

please let me know what should be expanded and/or changed.

@kortschak
Copy link
Member

kortschak commented Mar 5, 2017

Sorry to take so long. Would you send that as a proposal PR adding it as github.com/biogo/hts/proposal/cram.md. Then I can comment on specific lines and sections.

@kortschak
Copy link
Member

@brentp ping

@brentp
Copy link
Contributor Author

brentp commented Mar 23, 2017

shoot. this slipped past me somehow. will open a PR today.

@kortschak
Copy link
Member

kortschak commented Mar 30, 2017

@jkbonfield I have made a start on writing the major plumbing for cram support here, working from the spec. I have a few questions.

  1. In the container header the itf8[] field landmarks is the starting positions of each block in blocks. Is that correct? Resolved.

  2. The definition of T[] in section 2.3 (an itf8 value indicating length followed by len T elements) is at odds with the use of some arrays. For example in the container header, blocks is defined by the first field of container, length and the blocks field, the equivalent in the block header is closer, with an itf8 raw size field prior to the block data. Should I just ignore the spec here? Partially resolved (it works now but this is still confused).

  3. Related to 2. There is nowhere that I can see that explains how the starting positions of CRAM records in a core data block are specified. Is this encoded in the way specified in section 2.3 or is the start bit position encoded somewhere else?

thanks

@jkbonfield
Copy link

jkbonfield commented Mar 31, 2017

  1. This is an error. For block the byte[] bit should maybe be listed as "bytes" to indicate it's not a single byte and then describe the value as "the data stored in the block, of length size in bytes from above". Similarly byte[4] is a fixed size and not length 4 followed by 4 bytes. Maybe we should define csize: compressed size in bytes and usize: raw (uncompressed) size in bytes and then define the next field as byte[csize] to imply an array where the size is already known. We could update the type glossary at the start to explain that array[] is stored as length plus data while array[len] is stored as data alone with the len parameter being known upfront.

  2. I'm not sure I follow this question, but there is no random access within a CRAM block, only of whole blocks themselves. Therefore there is no need to know the starting position of any individual CRAM record as they immediately follow the previous record. In the "external" blocks this is byte based. With "core" block it is bit-based. For what it's worth we try to avoid using the core block as much as possible in modern data as it generates too many interdependencies between data series, preventing partial decode.

@kortschak
Copy link
Member

Thanks.

The issue in 3. is related to "core" blocks. The figure in section 8.6 shows the problem; how can I know where CRAM record n starts? While it may be the case that this block type is avoided, I still need to be able to handle it if the code encounters it.

@jkbonfield
Copy link

The answer is that you don't know where the CRAM record n starts without decoding record 0 to n-1 first as there isn't an index within a block. Each record starts with the next bit in CORE or the next byte in the EXTERNAL blocks, starting from 0 for the first.

I believe the original intention in the early days of CRAM was to entirely use CORE and to have a bit-level index so that full random access was possible, but that was before I got involved and it had changed considerably by then.

@kortschak
Copy link
Member

OK. So the follow up for that is how does a reader of the bit stream in record n know when that record has finished having reached the beginning of it. This is just a function of the codec behaviour on the bit stream?

@jkbonfield
Copy link

jkbonfield commented Mar 31, 2017

The bits read are entirely defined by the CRAM code structure itself. This is one of the things I struggled with early on and it's still not explicitly defined in the spec I fear for the order that things are encoded/decoded in. This really needs fixing, but it's not something I've had time to do and it's a large task to follow all the code paths and document them (sadly).

Basically see https://github.com/samtools/htslib/blob/develop/cram/cram_decode.c#L2386 for the start of the decode loop. It does things like this:

r |= c->comp_hdr->codecs[DS_BF]->decode(s, c->comp_hdr->codecs[DS_BF], blk, (char *)&bf, &out_sz);
r |= c->comp_hdr->codecs[DS_CF]->decode(s, c->comp_hdr->codecs[DS_CF], blk, (char *)&cf, &out_sz);
if (ref_id == -2)
    r |= c->comp_hdr->codecs[DS_RI]->decode(s, c->comp_hdr->codecs[DS_RI], blk, (char *)&cr->ref_id, &out_sz);
r |= c->comp_hdr->codecs[DS_RL]->decode(s, c->comp_hdr->codecs[DS_RL], blk, (char *)&cr->len, &out_sz);

etc

Each of the codecs is defined by the compression header. It gets given the core "blk" but the codecs are also able to read from the external blocks. The code structure itself is what defines where a read finishes and the next starts. It ends with decoding of the QS records.

We'd welcome spec improvements though. At the very least the tables of data series should be rewritten to be in the same order that the code uses.

@brittanyhowell
Copy link

Howdy, is there likely to be CRAM support in the near future?
Or is this an abandoned issue?

@kortschak
Copy link
Member

There is some work towards it in this branch. It is not likely that there will be CRAM support in the near future; I don't have either the work need or the available time (nor strength to cope with the spec) to do this at the moment. Someone with at least two of the previous three items should be able to pick it up without too much trouble though - a significant portion of the work has been done.

@kortschak
Copy link
Member

I have merged the CRAM support branch in. It is still a work in progress. The broad architecture of the format/container handling is implemented, but there needs to be work done on handling the details within blocks and slices.

@jkbonfield
Copy link

jkbonfield commented Mar 1, 2019

For CRAM 4 (ongoing - it's a "technology demo" at the moment rather than a recognised standard) I've been documenting the codecs as pseudocode and producing some simple implementations in Javascript (as a language with minimal syntactic sugar to get in the way of an explanatory algorithm). Sorry for slowly moving the goal posts! Some of those changes though have also trickled down to improving the CRAM 3 documentation.

The latest official CRAM spec, incase you haven't downloaded it recently, is therefore far easier to understand for things like rANS implementations due to the addition of the pseudocode. I'd also encourage you to look at the GMOD/cram-js for a clean easy to follow implementation, unlike the C version which is rather convoluted! GMOD/cram-js is read only, but if you want to see my Javascript rANS encoder too then the current progress is over at https://github.com/jkbonfield/hts-specs/blob/cram_reference/cram_reference/rans.js.

@jkbonfield
Copy link

Oh I forgot to add, see https://github.com/samtools/hts-specs/pull/383/files too for the rANS pseudocode in the spec. It needed some edits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants