Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save RepeatMasker results #54

Closed
CeciliaDeng opened this issue Aug 1, 2024 · 14 comments
Closed

Save RepeatMasker results #54

CeciliaDeng opened this issue Aug 1, 2024 · 14 comments
Assignees
Milestone

Comments

@CeciliaDeng
Copy link
Collaborator

Hi @GallVp and @jasonshiller, The output of RepeatMasker can be used for visualization and other applications. Can we please keep the results in final/ if genepao/pangene runs this step? Thank you.

My current approach is:

  1. Get info from .nextflow.log
    grep REPEATMASK .nextflow.log | grep COMPLETED > tmp.list

  2. Find the RepearMasker folder from the tmp.list, and copy (or move) the files to final/. For example,
    cp -p work/8d/eb6f63fca54c6eaaa28923947e4577/Rhap1/* results/final/Rhap1/RepeatMask/

  3. Convert the .out file to gff3 using the script from @ting-hsuan-chen:
    cd results/final/Rhap1/RepeatMask; /workspace/cflthc/script/KRIP_TE/09_benchmarking/RMout2gff3.sh Rhap1.fa.out

@GallVp
Copy link
Member

GallVp commented Aug 2, 2024

Hi @CeciliaDeng

Thank you for raising this issue. Nextflow has a native method for saving outputs. For example, here is how the TE lib from EDTA is saved:

https://github.com/PlantandFoodResearch/pangene/blob/6713761c94c12b527c2cfa045f0bc455ac369f04/conf/modules.config#L9-L14

We won't have to parse .nextflow ourselves. Step 3 however needs some additional work. The script needs to be converted to a Nextflow module and tested using a small test file. For example, see how the shorten_fasta_ids.py script has its own module: https://github.com/PlantandFoodResearch/pangene/blob/main/modules/pfr/custom/shortenfastaids/main.nf and the unit tests: https://github.com/PlantandFoodResearch/pangene/blob/6713761c94c12b527c2cfa045f0bc455ac369f04/modules/pfr/custom/shortenfastaids/tests/main.nf.test#L12

@GallVp GallVp added this to the 0.4.0 milestone Aug 6, 2024
@CeciliaDeng
Copy link
Collaborator Author

Awesome that this issue will be taken care in v0.4.0. Thank you @GallVp

@GallVp
Copy link
Member

GallVp commented Aug 7, 2024

I am going to use the official converter as it is more likely to receive an update when the RepeatMasker is updated.

@ting-hsuan-chen
Copy link

ting-hsuan-chen commented Aug 7, 2024

I didn't use rmOutToGFF3.pl because the information in column 9 is poor (e.g. "Target=FAM 24 180"). Alternatively, perhaps maybe the ".out" file can be saved as well? Then it would be up to the users to convert ".out" to gff3 by themselves with the information they like to add in column 9.

@GallVp
Copy link
Member

GallVp commented Aug 7, 2024

I didn't use rmOutToGFF3.pl because the information in column 9 is poor (e.g. "Target=FAM 24 180"). Alternatively, perhaps maybe the ".out" file can be saved as well? Then it would be up to the users to convert ".out" to gff3 by themselves with the information they like to add in column 9.

Yes, the outputs are saved if the repeatmasker_save_outputs flag is enabled.

@ting-hsuan-chen
Copy link

@GallVp in this case (generating gff3) you might only need to enable the option "-gff" of RepeatMasker, then you won't need to test the rmOutToGFF3.pl in the pipeline.

@GallVp
Copy link
Member

GallVp commented Aug 7, 2024

@GallVp in this case (generating gff3) you might only need to enable the option "-gff" of RepeatMasker, then you won't need to test the rmOutToGFF3.pl in the pipeline.

Thank you @ting-hsuan-chen

This is actually a better option.

@CeciliaDeng
Copy link
Collaborator Author

The only issue with '-gff' is that the file will fail gff3 validation when loading them into fairGenomes. The output.gff3 from @ting-hsuan-chen's code can be directly added as TE.gff3 and loaded as a TE track in JB2.

@rosscrowhurst
Copy link

rosscrowhurst commented Aug 8, 2024

I didn't use rmOutToGFF3.pl because the information in column 9 is poor (e.g. "Target=FAM 24 180"). Alternatively, perhaps maybe the ".out" file can be saved as well? Then it would be up to the users to convert ".out" to gff3 by themselves with the information they like to add in column 9.

Forgot I wrote this ages ago - its old and may or may not work but it was used for converting Repeatmasker 1 .out files to gff3 for loading in WebApollo (JBrowse 1.6)

/output/hrarnc/software/bin/convert_repeatmasker_out_2_gff3.pl

USAGE: /output/hrarnc/software/bin/convert_repeatmasker_out_2_gff3.pl -r=repeatmasker.out [-o=output.gff3 -source=RepeatMasker -type=dispersed_repeat]

Suffers from allowing just one type (value in column 3 in .out to .gff3 conversion)

Not written to add "##sequence-region ..." as I have another script that does it

@GallVp - yeah its not published, not supported etc like any other of my code that you might have used in your systems at some point

@GallVp
Copy link
Member

GallVp commented Aug 8, 2024

Thanks @CeciliaDeng and @rosscrowhurst

I'll investigate these issues and report back my findings here. I am removing the done on dev label for now.

@GallVp GallVp removed the done on dev label Aug 8, 2024
@GallVp
Copy link
Member

GallVp commented Sep 23, 2024

Here are my findings. The test file is attached. The test scripts are also attached.

  1. -gff flag does produce a gff file which is invalid because of a missing = in the attributes column.
  2. RepeatMasker/util/rmOutToGFF3.pl solves 1 but fails validation check due to repeated IDs.
  3. /workspace/cflthc/script/KRIP_TE/09_benchmarking/RMout2gff3.sh solves 2 but fails validation check due to non-compliant character case. It can be fixed with gt gff3 -tidy
  4. /output/hrarnc/software/bin/convert_repeatmasker_out_2_gff3.pl solves all of the above.

I want to use number 4 but it has no tests or test data. @rosscrowhurst can you kindly create a repo and publish it through PFR org? I think we should have a PFR/gffutils or a PFR/bioinf-misc repo so that we can all put our code there and include tests and test data. I am happy to add test data and setup tests once the repo has been published.

As the pipeline will be published, I cannot include untested custom scripts.

@GallVp
Copy link
Member

GallVp commented Sep 23, 2024

Test data and scripts for the above comment.

rm2gff3.zip

@GallVp GallVp modified the milestones: 0.4.0, backlog Sep 23, 2024
@GallVp
Copy link
Member

GallVp commented Sep 27, 2024

Got permission from @rosscrowhurst to include his script. I'll add unit tests.

@GallVp
Copy link
Member

GallVp commented Oct 6, 2024

@GallVp GallVp closed this as completed Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants