Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak GENCODE loading performance #55

Merged
merged 2 commits into from
Jul 26, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 23 additions & 15 deletions src/varity/ref_gene.clj
Original file line number Diff line number Diff line change
Expand Up @@ -107,15 +107,22 @@
(when (= (count row) 9)
{:attribute (->gencode-attr (nth row 8) attr-kv-sep)})))))

(defn- ->feature-map
(def ^:private ens-t-pattern #"ENST\d+\.\d+")
(def ^:private ens-g-pattern #"ENSG\d+\.\d+")

(def find-ens-t-id (memoize #(re-find ens-t-pattern %)))
(def find-ens-g-id (memoize #(re-find ens-g-pattern %)))

(defn- build-feature-map
([features]
(->feature-map features (zipmap [:transcript :exon :cds] (repeat {}))))
([gtf-lines data]
(reduce (fn [data {:keys [seqname feature attribute] :as gtf}]
(let [base (merge (select-keys gtf [:start :end])
{:chr seqname})]
(build-feature-map features (zipmap [:transcript :exon :cds] (repeat {}))))
([features data]
(reduce (fn [data {:keys [start end seqname feature attribute] :as gtf}]
(let [base {:start start
:end end
:chr seqname}]
(if-let [t-id (get attribute "transcript_id")]
(let [t-id (re-find #"ENST\d+\.\d+" t-id)]
(let [t-id (find-ens-t-id t-id)]
(case feature
"transcript"
(assoc-in data
Expand All @@ -124,7 +131,7 @@
{:name t-id
:name2 (get attribute "gene_name")
:gene-id (when-let [g-id (get attribute "gene_id")]
(re-find #"ENSG\d+\.\d+" g-id))
(find-ens-g-id g-id))
:strand (:strand gtf)
:score (:score gtf)}))

Expand All @@ -136,6 +143,7 @@
{:exon-number (as-long (get attribute "exon_number"))
:frame (:frame gtf)
:strand (:strand gtf)}))

"CDS"
(update-in data [:cds t-id] conj base)

Expand All @@ -145,7 +153,7 @@
data))
data)))
data
gtf-lines)))
features)))

(defn- extend-cds
"Extend 3'-most cds's `:end` or `:start` depending on the `strand` value
Expand Down Expand Up @@ -193,14 +201,14 @@
:strand strand))))

(defn load-gencode
[f parse-line]
[f parse-line & {:keys [chunk-size] :or {chunk-size 10000}}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to add this optional arg to both load-gtf and load-gff because currently these functions are easier to use. Or reimplement load-gencode to pick up an appropriate function based on f's extension (sorry my first implementation is not well designed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I had missed it.

BTW, I think the GTF/GFF3 loader should use cljam's implementation (https://github.com/chrovis/cljam/blob/master/src/cljam/io/gff.clj). I am willing to do a complete rewrite on that.

(with-open [rdr (io/reader (util/compressor-input-stream f))]
(let [feature-map (->> (line-seq rdr)
(keep parse-line)
->feature-map)]
(->> (:transcript feature-map)
(map #(->region feature-map %))
doall))))
(partition-all chunk-size)
(pmap #(doall (keep parse-line %)))
(apply concat)
build-feature-map)]
(doall (map (partial ->region feature-map) (:transcript feature-map))))))

(defn load-gtf
[f]
Expand Down