-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo #1589
Demo #1589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a cursory glance through flags_demo.c and split.c so far. More tomorrow.
We need to also decide on a house style for bracing. We have multiple ones in use in this code, but we should be consistent. Personally my preference is this as it's fewer lines of code.
if (expr)
statement;
else
statement;
if (multi_line_expr) {
do1;
do2;
} else {
do3;
}
I know others have open braces in the same column as the closing one (but as I say I see both here).
samples/split.c
Outdated
//open input file | ||
infile = sam_open(inname, "r"); | ||
if (!infile) { | ||
fprintf(stderr, "Could not open %s\n", inname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sam_open
will complain anyway, so this is redundant - for both input and output files. Eg:
@ seq4-head2[htslib/samples]; ./a.out notfound -o /tmp/nonexist
[E::hts_open_format] Failed to open file "notfound" : No such file or directory
Could not open notfound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All samtools code that invokes sam_open()
writes its own error message itself.
Personally I think it's a bug that sam_open
prints its own diagnostic. It is a library routine similar to fopen
so should be quiet by default, leaving it to the invoking code to print a diagnostic — which can print a better diagnostic because it knows what the purpose of the file is (e.g. it can write “can't open reference file” or similar to distinguish further which of several input files is missing).
Some htslib library functions print their own diagnostics because it's likely that invoking code won't print its own diagnostic. However file opening functions like sam_open
are the one case where HTSlib can depend on the invoking application code universally printing the diagnostic itself!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the library not reporting errors this way, but right now it does so. Ideally it needs an hts_strerror type of interface, maybe optionally taking a file pointer (like ferror), or we split those into two APIs. (See also comments in #1049). A long standing wish-list!
This code serves as two purposes really. One is standalone basic starting points that people may wish to copy and build on top of - a simple template design. In that things like error checking and reporting are important to put in as they foster good design principles.
The second is as text quoted in a tutorial or how-to guide on using htslib. This isn't just doxygen type manuals, which don't easily lend themselves to guidance. For those I think we could probably do some liberal pruning of error reporting (but not detection) for the sake of brevity. We already have some of this with liberal "..." to jump to the relevant code blocks. It may be better to avoid putting too much in main so we can document entire functions as isolated code fragments. That also lends itself better to just returning error codes instead of printing up messages. That depends on the application though.
samples/split.c
Outdated
inname = "-"; | ||
//output directory | ||
if (!outdir) { | ||
fprintf(stderr, "Output directory not specified\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One for @daviesrob to consider maybe, but we never use hts_log
functions from samtools and I don't know why. We could have instead:
hts_log_warning("no output directory specified");
or
hts_log(HTS_LOG_WARNING, "split","no output directory specified");
I'm not sure why we always use fprintf here, but we even simulate the same style of output:
https://github.com/samtools/samtools/blob/develop/sam_view.c#L1164
I suspect it's just history.
That said, it can be useful to not include this in the initial demonstration programs as we don't want to bamboozle people with a large API to start with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments for header.c.
The header API is complex and has a lot of features. We probably want to focus on the most common and basic operations people may want to do. That's things like checking the length of a reference sequence, or a general purpose find value X for key Y in header type @ZZ
.
This code specifies the -n
th line, and it then converts that to a string form using sam_hdr_line_name
and the idf
("SN" for SQ and "ID" for PG/RG) to a string, which is then looked up using sam_hdr_find_tag_id
. This doesn't really feel like something the user is ever going to want to do. We're unlikely to want to know about the 85th reference, but we may want to know about ref "hs37d5".
Furthermore there's no specific block on sam_hdr_find_tag_id
working with any key type. We can query SQ by SN (the most likely) but also by M5 if we wished to. The only real difference is that SN has an internal index (a hash table) for fast lookup while the others will do a full scan to find the line. However that's something we could describe in text if needs be.
Also as a demonstrationtool it feels a little muddled by trying to do many things. I don't think it needs the whole complexity of reporting all header lines types unless the user specifies one, which then gives rise to the whole headercnt and line types arrays etc. Viewing the entire header is easier via other mechanisms, so as demonstration of the API we should focus on the querying specific components aspect. This will make the code much simpler to follow, and hence focus on the key points we want to explain which is our API, rather than the program flow and logic.
Eg:
$ read_header
Usage: read_header file line-type key value [tag]
# Uses sam_hdr_find_line_id
$ read_header in.bam SQ SN hs37d5
@SQ SN:hs37d5 LN:35477943 M5:5b6a4b3a81a2d3c134b7d14bf6ad39f1 UR:/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa
# Uses sam_hdr_find_tag_id
$ read_header in.bam SQ SN hs37d5 LN
35477943
We can do that already (-h SQ -n 85 -t LN ~/scratch/data/novaseq.10m.cram
) but with a -n option instead.
samples/read_header.c
Outdated
//iterate and show 1st of each type or requested one | ||
for (c = 0; headercnt && (c < (sizeof(types) / sizeof(types[0]))); ++c) { | ||
//get count of given header type | ||
linecnt = sam_hdr_count_lines(in_samhdr, *(header + c)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*(header + c)
is better just written as header[c]
and makes it clearer we're just using an array of header types.
samples/read_header.c
Outdated
//find the id_key field for the requested header type - uses NULL for HD | ||
for (c = 0; c < (sizeof(types) / sizeof(types[0])); ++c) { | ||
if (!strcmp(types[c], optarg)) { | ||
idf = &idfield[c]; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both read_bam and read_aux would benefit from deliberately not trying to look like SAM. Maybe it more of a debug or "dump" tool.
Also for the aux handling, demonstrating the aux iterator is nice, but we should also demonstrate querying with eg
char *c = bam_aux_get(b, "NM");
int NM = c ? bam_aux2i(c) : 0;
So maybe the tool can print up entire lines or just report on specific tag types.
I still have specifics to go through with the demos, but I decided to take a more high level overview of things. I should have started with this before, sorry. Basically we have a few categories of tools, and their closest matching samtools equivalent where possible:
So we have a lot of tools, but not too many themes. That's good as it permits a better narrative in a tutorial or sets of related worked examples. Some of the APIs though I think are in the wrong place. Flagstat is a read-only tool, which is good for demonstrating decoding of the basic sequence API. This may mean read_bam is irrelevant. Or perhaps read_bam belongs in the same chapter / topic as the flagstat one. Flagstat is a perfect example of the basic The read/write loops could may be considered as a file format conversion tool, so we can introduce filename handling and setting format based on suffixes. All tools currently just use Pileup is its own thing, and definitely needs its own tutorial section. I'm happy with pileup vs mpileup, but the constructor/destructor interface is an extra complexity that has no purpose in the examples. We probably need something more advanced, such as optimising away an overhead from being computed per-column to per-read. There are choices there too - caching numerics in pileup1_t->cd.i vs caching complex things needing memory allocation, which is where the constructor/destructor come in. I'd need to think of a concise example, but test/pileup_mod.c is one such use where The index querying can come in at any point I guess, as it applies equallty to both a read-only scenario and a read-write scenario. I'm not sure which topic is the best one to add it to, but it's OK in the read/write code. The header API feels a bit disjoint, as well as split.c which I can't follow the intention of. Maybe these two can be merged into the same overall topic. Eg split by read-group requires processing of the header, and maybe we'd want to show how we can add new PG entries or remove other RG lines when splitting. Again this is a perfect example of where the thread pool can be useful as we're writing out several files at once. Aux tag manipulation could do with extending. Right now it's just regurgitating the aux tag data, which isn't so helpful (and also not the ideal way to do it). Maybe we need something to summarise tag utilisation (tagstats), or to add a new tag computed from the read data. Eg adding a cg:f tag to report the CG %age for each read. Maybe also for fun gather statistics on CG content vs MQUAL. Infact the latter half of that could be a read-BAM-only while the former is a read-write loop, so perhaps it's better than the flagstat and view alternatives? Thoughts? |
86b3809
to
d474a19
Compare
ffd8119
to
57bac5c
Compare
0347a59
to
22ca20a
Compare
The README.md file is a basic introduction to the sample code in this directory, while DEMO.md has more detailed information per program and API calls introduced. This is a work in progress and updated documentation may be available on www.htslib.org.
This pull request adds a number of sample programs to demonstrate the usage of HTSLib apis.
https://docs.google.com/document/d/1d6EECYkezwZWW4zWpyqGNHK4LIrcai8anJAvNbtxiVo/edit# gives a documentation of the same.