Skip to content

Commit

Permalink
Provenance (#54)
Browse files Browse the repository at this point in the history
* GitHub actions (#13)

* unit-testing actions

* unit-testing actions

* unit-testing actions

* unit-testing actions

* installing edirect

* installing edirect

* installing edirect

* installing edirect

* installing edirect

* rm travis

* edirect through apt

* edirect through apt

* Add files via upload

* adding taxonomy_v3.5.1

* More formats (#17)

* new files for individual genes and coordinates

* m

* new flag to include optional files with --and

* Listeria unit testing (#18)

* Listeria unit testing draft

* m

* debug

* debug

* debug

* update kalamari script; add --and flags

* kraken1 db

* m

* m

* m

* m

* editing PATH

* editing PATH

* fixing src path

* m

* fixing installation dir

* jellyfish1

* jellyfish1

* m

* just two genomes

* tree kraken

* added threads 2

* added threads 2

* build kraken -x

* work on disk in kraken

* debug

* trying out kraken2

* m

* removed rebuild and work-on-disk

* kraken report

* kraken report

* more inspection of kraken output

* more inspection of kraken output

* done with unit testing for now

Co-authored-by: Lee Katz - Aspen <[email protected]>

* new parent id

* a get taxonomy script for a reduced set of dmp files

* reduced taxonomy

* testing v3.9.2

* added parentid to plasmids

* Updating some Yersinia taxid (#16)

* Add files via upload

* adding taxonomy_v3.5.1

* adding v3.9.3 taxonomy

* m

* adding in Scott's Yersinia genomes

* cleanup

* updated to correct src tax dir

* Update unit-testing.yml

* Create CITATION.cff (#20)

* Create CITATION.cff

* Update CITATION.cff

* Kraken1 unit test (#21)

* with fixed taxonomy, unit test kraken1

* shortened the minimizer length to 9

* kraken1 query

* m

* adding a query is $query statement

Co-authored-by: Lee Katz - Aspen <[email protected]>

* Database doc update (#22)

* with fixed taxonomy, unit test kraken1

* shortened the minimizer length to 9

* kraken1 query

* m

* adding a query is $query statement

* Update DATABASES.md

* added blast and ANI instructions

* updated docs to reflect more comprehensive DATABASES.md

* m

Co-authored-by: Lee Katz - Aspen <[email protected]>

* mash database

* Define contributions (#23)

* validate taxonomy script

* unit testing for taxonomy

* unit testing for taxonomy

* moved XXXXXX entries to a todo file

* validating names.dmp and added new entries to make taxonomy more complete

* Contributing.md doc

* link to contributing.md

* more description under contributions

Co-authored-by: Lee Katz - Aspen <[email protected]>

* mmseqs2 just for fun

* m

* Sepia

* fixed bacillus genus back to bacteria in the plasmids (#24)

Co-authored-by: Lee Katz - Aspen <[email protected]>

* Build sepia (#25)

* fixed bacillus genus back to bacteria in the plasmids

* sepia building v1

* m

* sepia documentation and reference generation script

* m

Co-authored-by: Lee Katz - Aspen <[email protected]>

* fixed a bug where the same fasta file would be downloaded twice and given the parent taxid in addition to its own

* validate a kraken database better

* MIDAS

* m

* m

* Update README.md with reqs and recs (#29)

* Update chromosomes.tsv

* using GITHUB_PATH to solve CI problems

* m

* m

* limit tests to target branches

* jellyfish now in path

* m

* remove -x statement

* allow this workflow to work on master

* trying out taxonomy validator workflow

* remove kraken1 from testing on this branch

* fix path to taxonomy

* Fix ci (#31)

* using GITHUB_PATH to solve CI problems

* m

* m

* limit tests to target branches

* jellyfish now in path

* m

* remove -x statement

* allow this workflow to work on master

* trying out taxonomy validator workflow

* remove kraken1 from testing on this branch

* fix path to taxonomy

* check file sizes after pulling down accessions

* more debugging in the ci just in case

* change cryptosporidium parent taxids to cryptosporidium the genus

* marged new kalamari download script

* upped the version

* getExactTaxonomy.pl: better error messages

* downloadKalamari.pl: add in retmax 1

* only accept one sequence per insdc accession

* script to download kalamari from source

* numcpus option added; new bash script to download and format

* bash downloadKalamari.sh

* update to ubuntu 20

* 2 cpus in test

* add spreadsheet as a strategy variable

* m

* m

* split jobs between runners

* fix math

* adding more retries

* switch to 1 cpu for testing

* bump tag to v5.3.0

* std output for downloadKalamari.sh

* removed bioperl

* bump version; add more standard conda db location

* trying to speed up downloads
rd conda db location

* vast speed increase with batch downloads; cleaned up chromosomes.tsv

* moved version information to the script from Makefile.PL; removed --and; won't make kraken db in shell script

* m

* remove edirect setup unit test

* update unit tests

* just two chunks of tests

* batch more

* fix file sizes check

* just make the damn thing work

* bash file uses local repo files instead of curl; default buffer size 100

* More proper build (#42)

* Building taxonomy (#38)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* Fix the taxonomy tests (#39)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* fix paths

* updated PATH

* updated PATH

* troubleshooting

* fix PATH again

* fix ls path

* remove that step

* updated tests to reflect build-taxonomy (#40)

* fix path to taxonomy files

* download and build taxonomy

* merge Listeria into Yersinia matrix

* m

* updated output directory as matrix.GENUS

* kraken1 tests patches

* m

* Fixed two more tests (#41)

* update yml

* query fallback

* debugging msg

* fix path to taxonomydb

* print first two lines of fasta files

* helpful cut statement

* remove head statement in last step

* bump version

* fix a downloading bug where sed stalls

* update for compressed kalamari library and more efficient kraken builds

* update download script

* Validate taxonomy (#43)

* validateTaxonomy update for just taxdirs; add 1 for filtered taxonomy; added DEBUG option for downloadKalamari.sh

* updated unit tests

* updated unit tests

* remove taxonomy stuff from downloadKalamari.sh

* fix validateTaxonomy syscall

* check on filtered tax in unit test

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* init paper

* init paper

* some revisions; taxonomy; downloading

* some revisions; taxonomy; downloading

* swap example

* swap example

* references

* references

* stole Joe's draft-pdf.yml

* stole Joe's draft-pdf.yml

* update to version 4 of artifacts

* update to version 4 of artifacts

* plasmids description

* plasmids description

* ignore rendered manuscripts

* ignore rendered manuscripts

* some minor fixes; author affiliations; code examples

* some minor fixes; author affiliations; code examples

* added Shatavia; updated example

* added Shatavia; updated example

* m

* m

* revisions from Jess

* revisions from Jess

* refs

* refs

* fix list that became italics

* fix list that became italics

* updated Andrew's affiliation

* updated Andrew's affiliation

* plasmid defined species

* plasmid defined species

* gave a name to the JOSS rendering

* gave a name to the JOSS rendering

* try experimental docx file creation

* try experimental docx file creation

* try 2 with container

* try 2 with container

* correct artifact Action

* correct artifact Action

* m

* m

* upload artifact v4

* upload artifact v4

* branch agnostic

* branch agnostic

* try multiple formats; multiple uploads

* try multiple formats; multiple uploads

* fix some citations

* fix some citations

* fixed Dr. Lauer's info

* fixed Dr. Lauer's info

* remove format arg

* remove format arg

* shatavia's orcid

* shatavia's orcid

* added Rebecca's and Jess's orcids

* added Rebecca's and Jess's orcids

* updated DOIs

* updated DOIs

* fixed comment line

* fixed comment line

* added Entrez Edirect URL

* added Entrez Edirect URL

* more Entrez citation with help from CoPilot

* more Entrez citation with help from CoPilot

* Andrew's orcid

* Andrew's orcid

* misc

* misc

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* updated revisions from coauthors

* updated revisions from coauthors

* entered Taylor's revisiosn

* entered Taylor's revisiosn

* move Katie to acknowledgements due to her request

* move Katie to acknowledgements due to her request

* update genome list; stable efetching (#49)

* Add genomes (#45)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* Esearch input (#47)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* Esearch input flag (#48)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* get latest edirect

* update installation instructions

* update installation instructions: fix PATH

* bring in other tests

* update installation method for search with unit-testing

* update installation method for search with kraken2

* debug the ls statement

* debug the ls statement

* debug the ls statement

* debug building taxonomy

* exclusive unit testing for taxonomy for right now

* install taxonkit

* changes from cdc clearance process

* changes from cdc clearance process

* disable buggy docx creation

* disable buggy docx creation

* fix blast+ formatting typo

* fix blast+ formatting typo

* Change to MIT license

* Update README.md: remove CC license sticker

* update entrez ref

* update entrez ref

* MRA

* MRA

* MRA

* MRA

* misc

* misc

* 500 words or less

* 500 words or less

* nix example

* nix example

* abstract

* abstract

* abbreviate genera

* abbreviate genera

* another paper revision

* another paper revision

* added asm pandoc template

* added asm pandoc template

* provenance

* Leptospira interrogans => CP020414

* some progress

* downloadKalamari.sh: nuccleotideAcc bug fuxed

* v5.7.2

* another round of provenance

* cleared out the unknowns list

* fixed chromosomes with sources

* chromosomes

* try to run CI

* fix wildcard

* better named sources for each assembly

* polish this directory

* assembly-complete.gz

* taylor's corrected orcid

* revert back to pdf of joss paper instead of MRA

* merge

---------

Co-authored-by: Scott Nguyen <[email protected]>
Co-authored-by: Scott Nguyen <[email protected]>
Co-authored-by: Curtis Kapsak <[email protected]>
  • Loading branch information
4 people authored Dec 31, 2024
1 parent abba43a commit e75b9e7
Show file tree
Hide file tree
Showing 40 changed files with 131,700 additions and 300 deletions.
47 changes: 47 additions & 0 deletions .github/workflows/draft-docx.yml.bak
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
on:
push:
paths:
- 'paper/**'
- '.github/workflows/draft-docx.yml'

name: JOSS docx rendering
env:
OPENJOURNALS_PATH: /usr/local/share/openjournals
format: docx
article_info_option: ""

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
container:
image: openjournals/inara:latest
env:
GIT_SHA: $GITHUB_SHA
JOURNAL: joss
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft docx
# inara -o docx paper/paper.md
run: |
/usr/local/bin/pandoc \
--data-dir="${{ env.OPENJOURNALS_PATH }}/data" \
--defaults="${{ env.OPENJOURNALS_PATH }}/${{ env.format }}/defaults.yaml" \
${{ env.article_info_option}} \
--resource-path=.:${{ env.OPENJOURNALS_PATH }}/scripts \
--variable="${{ env.JOURNAL }}" \
--variable=retraction:"${{ env.retraction }}" \
--variable=draft:"${{ env.draft }}" \
--metadata=draft:"${{ env.draft }}" \
--log="$logfile" \
"$input_file" \
"$@"
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.docx
29 changes: 29 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
on:
push:
paths:
- 'paper/**'
- '.github/workflows/draft-pdf.yml'

name: JOSS pdf rendering

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
3 changes: 2 additions & 1 deletion .github/workflows/unit-testing.Listeria.Kraken1.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master, dev]
branches: [master, dev, validate-taxonomy]
pull_request:
name: Listeria-with-Kraken1

env:
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/unit-testing.Yersinia.Kraken2.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master, dev]
branches: [master, dev, validate-taxonomy]
pull_request:
name: Genera-with-Kraken2

env:
Expand Down Expand Up @@ -34,14 +35,12 @@ jobs:
- name: env check
run: |
echo $PATH | tr ':' '\n' | sort
- name: install-edirect
run: |
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
tree $HOME/edirect
- name: apt-get install
run: sudo apt-get install ca-certificates tree jellyfish ncbi-entrez-direct
- name: select for only for this genus
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/unit-testing.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
on:
push:
branches: [master, dev]
branches: [master, dev, validate-taxonomy]
pull_request:
name: Pull-down-all-accessions

jobs:
Expand All @@ -25,7 +26,6 @@ jobs:

- name: apt-get install
run: sudo apt-get install ca-certificates tree

- name: install-edirect
run: |
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
on:
push:
branches: [master, dev, esearch-input]
pull_request:
name: Validate taxonomy

jobs:
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
edirect
share
paper/paper.html
paper/paper.doc
# pixi environments
.pixi
*.egg-info
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# Kalamari
A database of completed assemblies for metagenomics-related tasks

## Synopsis

Expand All @@ -9,12 +8,14 @@ These assemblies can be further used in formatted databases such as Kraken or Bl
### Prerequisites & Recommendations

Requirements:

- clone this repo locally `git clone https://github.com/lskatz/Kalamari.git`
- NCBI entrez-utilities set of tools `edirect`, `esearch`, etc.
- install via your package manager
- debian/ubuntu: `apt install ncbi-entrez-direct`
- debian/ubuntu: `apt install ncbi-entrez-direct`

Optional, but recommended:

- `NCBI_API_KEY` environmental variable
- `EMAIL` environmental variable

Expand All @@ -27,12 +28,12 @@ After obtaining an NCBI API key, add it to your environment with

where `unique_api_key_goes_here` is a unique hexadecimal number with characters from 0-9 and a-f.

You should also set your email address in the
You should also set your email address in the
`EMAIL` environment variable as edirect tries to guess it, which is an error prone process.
Add this variable to your environment with

export [email protected]

using your own email address instead of `[email protected]`.

## Download instructions
Expand Down
1 change: 0 additions & 1 deletion bin/buildKraken1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ kraken-build --db $DB --build --threads 1
# Reduce the size of the database
kraken-build --db $DB --clean


if [ ! -e "$sharedir/kalamari-kraken1" ]; then
ln -sv kalamari-kraken "$sharedir/kalamari-kraken1"
fi
175 changes: 175 additions & 0 deletions paper/asm.csl
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" class="in-text" version="1.0" demote-non-dropping-particle="never" default-locale="en-US">
<info>
<title>American Society for Microbiology</title>
<title-short>ASM</title-short>
<id>http://www.zotero.org/styles/american-society-for-microbiology</id>
<link href="http://www.zotero.org/styles/american-society-for-microbiology" rel="self"/>
<link href="https://journals.asm.org/references" rel="documentation"/>
<author>
<name>Julian Onions</name>
<email>[email protected]</email>
</author>
<contributor>
<name>Rintze Zelle</name>
<uri>http://twitter.com/rintzezelle</uri>
</contributor>
<contributor>
<name>Richard Karnesky</name>
<email>[email protected]</email>
<uri>http://arc.nucapt.northwestern.edu/Richard_Karnesky</uri>
</contributor>
<contributor>
<name>Charles Parnot</name>
<uri>http://twitter.com/cparnot</uri>
<email>[email protected]</email>
</contributor>
<contributor>
<name>Patrick O'Brien</name>
</contributor>
<category citation-format="numeric"/>
<category field="biology"/>
<summary>Style for all American Society for Microbiology journals.</summary>
<updated>2022-01-22T01:10:09+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
</info>
<macro name="author">
<names variable="author" suffix=".">
<name sort-separator=" " initialize-with="" name-as-sort-order="all" delimiter=", " delimiter-precedes-last="always"/>
</names>
</macro>
<macro name="issued">
<group prefix=" " suffix=".">
<choose>
<if type="patent">
<date variable="issued">
<date-part name="month" suffix=" "/>
<date-part name="year"/>
</date>
</if>
<else>
<date variable="issued">
<date-part name="year"/>
</date>
</else>
</choose>
</group>
</macro>
<macro name="chapter-specifics">
<choose>
<if type="chapter paper-conference" match="any">
<label variable="page" form="short" plural="never" prefix=", " suffix=" "/>
<text variable="page"/>
<text term="in" text-case="capitalize-first" prefix=". " suffix=" " font-style="italic"/>
<names variable="editor" delimiter=", " suffix=", ">
<name initialize-with="" delimiter=", " name-as-sort-order="all" delimiter-precedes-last="always"/>
<label form="short" prefix=" (" suffix=")"/>
</names>
</if>
</choose>
</macro>
<macro name="patent-specifics">
<text variable="number" prefix=". "/>
</macro>
<macro name="container-title">
<choose>
<if type="bill book chapter graphic legal_case legislation motion_picture paper-conference report song" match="any">
<text variable="container-title"/>
</if>
<else>
<text variable="container-title" form="short" strip-periods="true" prefix=". "/>
</else>
</choose>
</macro>
<macro name="edition">
<choose>
<if is-numeric="edition">
<group delimiter=" " prefix=", ">
<number variable="edition" form="ordinal"/>
<text term="edition" form="short"/>
</group>
</if>
<else>
<text variable="edition" suffix="."/>
</else>
</choose>
</macro>
<macro name="publisher">
<choose>
<if type="article-journal article-magazine" match="none">
<group delimiter=". " prefix=". ">
<choose>
<if type="book" match="none">
<text variable="genre"/>
</if>
</choose>
<group delimiter=", ">
<text variable="publisher"/>
<text variable="publisher-place"/>
</group>
</group>
</if>
</choose>
</macro>
<macro name="locators">
<choose>
<if type="article-journal">
<choose>
<if match="none" variable="volume page">
<text variable="DOI" prefix=" https://doi.org/"/>
</if>
<else>
<group prefix=" " delimiter=":">
<text variable="volume"/>
<text variable="page"/>
</group>
</else>
</choose>
</if>
<else-if type="article">
<text variable="DOI" prefix=" https://doi.org/"/>
</else-if>
<else-if type="book webpage post post-weblog" match="any">
<group delimiter=". " prefix=". ">
<text variable="URL"/>
<group delimiter=" ">
<text term="retrieved" text-case="capitalize-first"/>
<date variable="accessed">
<date-part name="day"/>
<date-part name="month" prefix=" "/>
<date-part name="year" prefix=" "/>
</date>
</group>
</group>
</else-if>
</choose>
</macro>
<macro name="title">
<group delimiter=" ">
<text variable="title"/>
<text variable="version" prefix="(" suffix=")"/>
</group>
</macro>
<citation collapse="citation-number">
<sort>
<key variable="citation-number"/>
</sort>
<layout prefix="(" suffix=")" delimiter=", ">
<text variable="citation-number"/>
</layout>
</citation>
<bibliography entry-spacing="1" line-spacing="2" second-field-align="flush">
<layout suffix=".">
<text variable="citation-number" suffix=". "/>
<text macro="author"/>
<text macro="issued"/>
<text macro="title" prefix=" "/>
<text macro="chapter-specifics"/>
<text macro="patent-specifics"/>
<text macro="container-title"/>
<text macro="edition"/>
<text macro="publisher"/>
<text macro="locators"/>
</layout>
</bibliography>
</style>
Loading

0 comments on commit e75b9e7

Please sign in to comment.