Inputs can also be files of files, not just folders

This way we follow the example of other bioinfo tools, which allow for users to store their inout files across many directories, which in the case of having thousands of inputs is a performance issue. It also gives the users more flexibility. Also added a specific unit test
microbial-pangenomes-lab · May 21, 2024 · e504cee · e504cee
1 parent c37ad01
commit e504cee
Show file tree

Hide file tree

Showing 6 changed files with 144 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -115,6 +115,16 @@ among them:
 * `--start -50 --stop 100 --sample 0.1`, will restrict the plot to 10% of samples and to the -50 to +100 region relative to the start codon
 * adding `--nucleotides` to the above command will add the nucleotide letters to each plot
 
+# Working with a very large dataset
+
+**Note:** this is a new functionality introduced in v1.6.0
+
+If you are working with more than a few thousand input files, it is poor practice to have
+all the inputs in a single directory (e.g. for performance reasons). Following what
+other bioinformatic tools do to solve this issue, the `--gff` and `--fasta` arguments
+can also be provided as "files-of-files", where the path to each input file is written
+in each line.
+
 # Prerequisites:
 
 The following packages and version have been used to develop and test `panfeed`

diff --git a/panfeed/__init__.py b/panfeed/__init__.py
@@ -1 +1 @@
-__version__ = '1.5.2-dev'
+__version__ = '1.6.0'
diff --git a/panfeed/__main__.py b/panfeed/__main__.py
@@ -87,7 +87,9 @@ def get_options():
     parser.add_argument("-g", "--gff",
                         required=True,
                         help = "Directory containing all samples' GFF "
-                               "files (must contain nucleotide sequence as "
+                               "files, or a file listing the relative path "
+                               "to each GFF file, one per line "
+                               "(must contain nucleotide sequence as "
                                "well unless -f is used, "
                                "and samples should be named in the "
                                "same way as in the panaroo header)")
@@ -118,10 +120,12 @@ def get_options():
 
     parser.add_argument("-f", "--fasta",
                         help = "Directory containing all samples' nucleotide "
-                               "fasta files (extension either .fasta "
+                               "fasta files, or a file listing the relative "
+                               "path to each fasta file, one per line "
+                               "(extension either .fasta "
                                "or .fna, "
                                "samples should be named in the "
-                               "same way as in the panaroo header")
+                               "same way as in the panaroo header)")
 
     parser.add_argument("-k", "--kmer-length", type = int,
                         default = 31,