Skip to content

Commit

Permalink
Merge pull request #136 from Plant-Food-Research-Open/fix/gff_validation
Browse files Browse the repository at this point in the history
Now correctly validating gff files for circular sequences
  • Loading branch information
GallVp authored Sep 19, 2024
2 parents d1a7cff + e494dda commit d16decb
Show file tree
Hide file tree
Showing 5 changed files with 142 additions and 6 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

1. Made the `hic` param pattern more flexible as `^SR\w+$|^\S+\{1,2\}[\w\.]*\.f(ast)?q\.gz$` [#130](https://github.com/Plant-Food-Research-Open/assemblyqc/issues/130)
2. Fixed flowchart syntax to remove '\n' [#132](https://github.com/Plant-Food-Research-Open/assemblyqc/issues/132)
3. Updated modules to remove Bioconda `defaults` channel
3. Updated modules to remove Bioconda `defaults` channel [#135](https://github.com/Plant-Food-Research-Open/assemblyqc/issues/135)
4. Now gff files for circular molecules can have end coordinates greater than the sequence length [#129](https://github.com/Plant-Food-Research-Open/assemblyqc/issues/129)

### `Dependencies`

Expand Down
2 changes: 1 addition & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@
},
"gff3_gt_gff3_gff3validator_stat": {
"branch": "main",
"git_sha": "775762619b57101ca800269b6ecda0b915fb9913",
"git_sha": "58c5f9e695b9e03d43e4c59d9339af7c93f0acbe",
"installed_by": ["subworkflows"]
}
}
Expand Down
49 changes: 46 additions & 3 deletions subworkflows/gallvp/gff3_gt_gff3_gff3validator_stat/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -129,19 +129,62 @@ def checkGff3FastaCorrespondence(meta, gff3File, faiFile) {
def end = parts[4].toInteger()
def seqLength = sequenceLengths[name].toInteger()

if (start > seqLength || end > seqLength) {
if ( start > seqLength ) {
return [
meta,
[], // success log
[
"Failed to validate gff3: ${gff3File.name}",
"Coordinates exceed sequence length in GFF3 file:",
"Start coordinates exceed sequence length in the GFF3 file:",
"Sequence: $name",
"Sequence length: $seqLength",
"Start: $start"
] // error log
]
}

if ( end > seqLength ) {

// Check if the sequence is defined as a circular region
// Otherwise, fail
def regionLine = gff3Lines.find {
def _parts = it.split('\t')

_parts[0] == "$name" && _parts[2] == 'region'
}

if ( ! regionLine ) {
return [
meta,
[], // success log
[
"Failed to validate gff3: ${gff3File.name}",
"End coordinates exceed sequence length and the sequence attributes are also missing in GFF3 file:",
"Sequence: $name",
"Sequence length: $seqLength",
"End: $end"
] // error log
]
}

def regionAtts = regionLine.split('\t')[8]
def isCircular = regionAtts.contains('circular=true')

// Models on circular molecules are allowed to exceed sequence length
if ( isCircular ) { continue }

return [
meta,
[], // success log
[
"Failed to validate gff3: ${gff3File.name}",
"End coordinates exceed length of a non-circular sequence in GFF3 file:",
"Sequence: $name",
"Sequence length: $seqLength",
"Start: $start",
"End: $end"
] // error log
]

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,38 @@ nextflow_workflow {
workflow.out.valid_gff3,
workflow.out.versions).match()
},
{ assert path(workflow.out.log_for_invalid_gff3[0][1]).text.contains('Coordinates exceed sequence length in GFF3 file') }
{ assert path(workflow.out.log_for_invalid_gff3[0][1]).text.contains('Start coordinates exceed sequence length in the GFF3 file') }
)
}
}

test("sarscov2 - fasta - circular_region - pass") {

when {
workflow {
"""
def circular_gff = new File('circular_gff.gff')
circular_gff.text = [
'##gff-version 3',
'MT192765.1 Genbank region 1 29829 . + . circular=true',
'MT192765.1 Genbank gene 29551 39667 . + . ID=gene1',
'MT192765.1 Genbank CDS 29551 39667 . + 0 Parent=gene1'
].join('\\n')
input[0] = Channel.of([ [ id:'test' ], // meta map
circular_gff.toPath()
])
input[1] = Channel.of([ [ id:'test' ],
file(params.modules_testdata_base_path + 'genomics/sarscov2/genome/genome.fasta', checkIfExists: true)
])
"""
}
}

then {
assertAll(
{ assert workflow.success},
{ assert snapshot(workflow.out).match() }
)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,67 @@
},
"timestamp": "2024-07-29T16:22:06.684959"
},
"sarscov2 - fasta - circular_region - pass": {
"content": [
{
"0": [
[
{
"id": "test"
},
"test.gt.gff3:md5,b3bb01b18b8eeac28922ab55c5c6c939"
]
],
"1": [
[
{
"id": "test"
},
"test.yml:md5,545b8e290cfa8a93fd0ff01ad9daee08"
]
],
"2": [

],
"3": [
"versions.yml:md5,0cb9519e626e5128d8495cf29b7d59ff",
"versions.yml:md5,80555fe6e28e9564cb534f5478842286",
"versions.yml:md5,8a418ac34d045b0cdac812eb2dc9c106",
"versions.yml:md5,c89b081a13c68acc5326e43ca9104344"
],
"gff3_stats": [
[
{
"id": "test"
},
"test.yml:md5,545b8e290cfa8a93fd0ff01ad9daee08"
]
],
"log_for_invalid_gff3": [

],
"valid_gff3": [
[
{
"id": "test"
},
"test.gt.gff3:md5,b3bb01b18b8eeac28922ab55c5c6c939"
]
],
"versions": [
"versions.yml:md5,0cb9519e626e5128d8495cf29b7d59ff",
"versions.yml:md5,80555fe6e28e9564cb534f5478842286",
"versions.yml:md5,8a418ac34d045b0cdac812eb2dc9c106",
"versions.yml:md5,c89b081a13c68acc5326e43ca9104344"
]
}
],
"meta": {
"nf-test": "0.9.0",
"nextflow": "24.04.4"
},
"timestamp": "2024-09-19T13:53:32.901064"
},
"sarscov2-genome_gff3-homo_sapiens-genome_fasta-correspondence_fail": {
"content": [
[
Expand Down

0 comments on commit d16decb

Please sign in to comment.