Skip to content

Latest commit

 

History

History
806 lines (642 loc) · 32.4 KB

rss-saver.org

File metadata and controls

806 lines (642 loc) · 32.4 KB

RSS-saver

RSS-saver

A simple Clojure (Babashka) script to save [email protected] blog posts locally.

I use hey.com’s blog feature to write blog posts to help me clarify and improve my own thinking about life, design, and programming. It’s a cool feature of a nice email service, but I worry that I may not be able to retrieve my posts in the event of service shutdown, or if I move on to another email provider in the future.

This script is meant to be run automatically every night (or so) via a crontab entry. It downloads the feed.atom xml file from the provided URL, checks for any changes from the previous download, and saves new entries.

readme

```
 ______________
|[]            |
|  __________  |
|  |  RSS-  |  |
|  | saver  |  |
|  |________|  |
|   ________   |
|   [ [ ]  ]   |
\___[_[_]__]___|

```

# RSS-saver
A simple Clojure (Babashka) script to save [email protected] blog posts locally.

I use hey.com's blog feature to write blog posts to help me clarify and improve my own thinking about life, design, and programming. It's a cool feature of a nice email service, but I worry that I may not be able to retrieve my posts in the event of service shutdown, or if I move on to another email provider in the future.

This script is meant to be run automatically every night (or so) via a crontab entry. It downloads the feed.atom xml file from the provided URL, checks for any changes from the previous download, and saves new entries.

You can read the [design doc](./rss-saver.org#design) for a complete understanding of this project.

You can check out an interactive blog post version of this project [here](https://adam-james-v.github.io/dev/rss-saver-web/).

## Installation
If you already have the latest Babashka, all you have to do is grab the script from this repo:

```sh
git clone https://github.com/adam-james-v/rss-saver.git
cd rss-saver
```

And install the script `rss-saver.clj` wherever you like to run scripts from.

```sh
cp rss-saver.clj /path/to/your/scripts/dir
```

## Usage
To run this script once, you must supply the feed url with `-u` or `--url`. The provided URL must point to the rss feed XML file directly. For example, my URL is [https://world.hey.com/adam.james/feed.atom](https://world.hey.com/adam.james/feed.atom).

**NOTE:** Your URL *must* point to the feed.xml file for this script to work.

```sh
bb rss-saver.clj -u YOUR_URL
```

The above command will download the XML file, parse and save to .edn files each post into the ./posts folder. You can change some options with the following:

```sh
Usage:
 -h, --help
 -u, --url URL                 The URL of the RSS feed you want to save.
 -d, --dir DIR        ./posts  The directory where articles will be saved.
 -f, --format FORMAT  edn      The format of saved articles. Either 'html' or
                               'edn' for a Clojure Map with the post saved as
                               Hiccup style syntax. Defaults to edn if unspecified.
 -c, --clear                   Clear the cached copy of the previous feed.
 -s, --silent                  Silence the script's output.
```

I use this script with an automation using crontab on macOS. My crontab entry:

```sh
0 12 * * * /usr/local/bin/bb /Users/adam/scripts/rss-saver/rss-saver.clj -u https://world.hey.com/adam.james/feed.atom -d /Users/adam/scripts/rss-saver/posts
```

Which runs the script once at noon every day. It's default save directory is ./posts, and cron runs the script from your home folder, so my articles are saved in `/Users/adam/scripts/rss-saver/posts`, but you can set the path to wherever you want using the `-d` or `--dir` options. I recommend using an absolute path to avoid confusion.

## Requirements
RSS-Saver requires [Babashka](https://github.com/babashka/babashka). While writing this script, I was using *version 0.6.0*. The script uses the following libraries, which are bundled with the latest Babashka:

 - clojure.string
 - clojure.data.xml
 - clojure.java.io
 - clojure.zip
 - clojure.java.shell
 - clojure.tools.cli
 - hiccup.core

If you want to run things automatically, you need some mechanism to automate running scripts. I am using crontab.

## Tests
Since I've written this project in a literate style with org-mode, I can use Noweb and named code blocks to 'tangle' two versions of the script: **rss-saver.clj** which contains the code and the CLI but no tests, and **rss-saver-tests.clj** which contains the code and tests but no CLI.

To run tests:

```sh
bb rss-saver-tests.clj
```

Tests use `clojure.test` and invoke all tests by calling `(t/run-tests)` at the bottom of the script.

## Status
The script is complete and working as intended. Bugs will be fixed if I encounter them or if someone posts an issue. This is intended to be a *very* simple script with a small and specific scope, so new features won't be implemented. This project is *done* (Yay!).

deps

{:deps
  {org.clojure/data.xml {:mvn/version "0.0.8"}
   org.clojure/zip      {:mvn/version "1.10.2"}
   hiccup/hiccup        {:mvn/version "2.0.0-alpha2"}}}

Design

The problem that RSS-saver solves is one of backing up and keeping open access to my blog posts on https://world.hey.com/adam.james; a useful blog service offered to paying Hey email users. I enjoy this service and intend to keep using it for some time, but there’s always the potential that the service changes or disappears, or I change my mind about using it. In such a case, I want to be certain that my blog posts are still available to me in some useful form.

So, this project has the following requirements:

  • automatically download the RSS feed XML
  • parse the feed into individual entries
  • cache the feed to avoid constantly re-parsing downloaded posts
  • save entries in an open format
  • runnable as a Babashka script with no external deps
  • work with [email protected] feed URLs

And will not:

  • guarantee correct parsing of feeds from other services
  • render the posts into anything other than a basic .html page or .edn file.
  • handle automation internally
  • detect changes to the feed; only pull/compare every time the script runs

This project is considered complete when the above requirements are met with clearly working functionality. That is, the invokation of the script, with the proper URL parameter, must successfully download, parse, and save the blog entries to my save directory.

Meta-Problem

I have a problem of not always finishing my work. As a self-taught dev, I often worry that I’m missing big important skills in software development, and one thing I know for sure is that an inability to finish projects is a problem. This project is the first of a series of small yet concrete projects that can be well-designed, well-scoped, and clearly considered finished once the design goal has been met.

In short, this project aims to solve my meta-problem of having a weak ability to design and complete software projects. This design doc is a specific effort on my part to be clear up-front about the project’s goals and intent.

RSS

Here’s my RSS link that I’ll be using:

https://world.hey.com/adam.james/feed.atom

I assume that the atom file at that address is automatically updated any time a post is created, and I assume it’s just XML with all of the blog’s content.

What I’m pretty sure RSS does:

Every time the site updates, the feed.atom file is re-generated with the newest content appended. Then, the RSS reader is a separate app that polls feed.atom URLs, downloads them, and parses/displays the contents according to the app’s design.

Using these assumptions, I am making a very simple tool that just pulls the entire feed XML every time, compares it to a cached file, and parses new entries into some structure which can be saved.

Downloading the Feed XML

To download the feed, I will simply use (slurp url).

Parsing

To parse the feed, I am using clojure.data.xml and some zipper manipulation functions. The feed is parsed into an XML tree. At this point, I can grab a list of nodes that match the entry tag. It is this list of entries over which I map various functions to clean up and ultimately save the entries as files (.html or .edn) in the posts directory.

My format of choice is a .edn file which is just the Clojure map for each entry saved to a file. The map contains the following keys: (:email :content :updated :name :title :link :id :post :published). Most keys are self-explanatory, but I want to note the :post and :content keys, which are a bit ambiguous.

The :content key is the unmodified XML tree node that comes from the initial parse of the feed. This is left so that any future scripts or rendering functions still have access to the entirety of the unchanged data.

The :post key contains the parsed and modified Hiccup data structure, which follows some specific logic for formatting and improving the html’s structure. For example, instead of plain strings and <br> tags, <p> tags are used. This data manipulation is suited to my purposes, and leaves a nice, clean, hiccup structure for future rendering scripts. It is exactly this :post value that gets rendered when exporting the basic .html page. If other users wish to handle the posts differently, they can use the :content key as previously mentioned.

Caching

To cache, I save the downloaded feed.xml into the posts directory. Then, whenever the script is run, I slurp both the current feed from the URL and the previous feed from the local file. With each in memory, I parse them into XML trees and get the entry nodes into a set. Removing from the current set all entries from the previous set, I am left with only new posts. If the set is empty, no further action is taken and the script terminates with a message.

Saving

All saving (of the cache and posts) is handled with (spit (str dir file)). Formats are limited to .html and .edn, and the main reason .html is provided is because I get it ‘for free’ because I want to have my posts saved in .edn files with a clean Hiccup style structure.

Using Babashka

I want to use Babashka because I really love Clojure but want a tool that is mentally ‘lightweight’ and very quick and easy. Babashka v0.6.0 has a bunch of built in libraries already and works quickly and reliably. I won’t need any dependencies to be downloaded for this script, which keeps its portability high, and makes it straight forward for other people to fork and modify the script for their own purposes, if they desire.

I am only guaranteeing that the parsing strategy in this script will work for hey.com feeds, as I really don’t want to cover other scenarios. I can’t predict what other people might want from other feeds. The strategy in this script is quite simple, so anyone could modify things to fit the feeds they care about anyway. As well, I do also save the un-modified content node, which can be used to construct whatever render someone could want.

Other feeds may actually work fine, but I’m not guaranteeing it. Nor am I going to modify my script to handle them.

main

ns

As part of the design criteria, I want this to work without pulling any new libraries from outside of the babashka tool. This means sticking with clojure.data.xml even though other libraries might be a little more straight forward. I can build a zipper editor easily enough so it’s not a problem.

I’ll want to run it as a CLI, so I’ll need tools.cli as well.

#!/usr/local/bin/bb
(ns rss-saver.main
  (:require [clojure.string :as str]
            [clojure.data.xml :as xml]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [clojure.java.shell :as sh :refer [sh]]
            [clojure.tools.cli :as cli]
            [hiccup.core :refer [html]]))

zipper-tools

I want to get better with zippers, but for now, I can use the examples provided by https://ravi.pckl.me/short/functional-xml-editing-using-zippers-in-clojure/. I should probably make a post/video about zippers to improve my own understanding of them, and re-implement my own editor functions in that process.

;; https://ravi.pckl.me/short/functional-xml-editing-using-zippers-in-clojure/
(defn edit-nodes
  "Edit nodes from `zipper` that return `true` from the `matcher` predicate fn with the `editor` fn.
  Returns the root of the provided zipper, *not* a zipper.
  The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil).
  The `editor` fn expects a `node` and returns a potentially modified `node`."
  [zipper matcher editor]
  (loop [loc zipper]
    (if (zip/end? loc)
      (zip/root loc)
      (if-let [matcher-result (matcher loc)]
        (let [new-loc (zip/edit loc editor)]
          (if (not (= (zip/node new-loc) (zip/node loc)))
            (recur (zip/next new-loc))
            (recur (zip/next loc))))
        (recur (zip/next loc))))))

(defn remove-nodes
  "Remove nodes from `zipper` that return `true` from the `matcher` predicate fn.
  Returns the root of the provided zipper, *not* a zipper.
  The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil)."
  [zipper matcher]
  (loop [loc zipper]
    (if (zip/end? loc)
      (zip/root loc)
      (if-let [matcher-result (matcher loc)]
        (let [new-loc (zip/remove loc)]
          (recur (zip/next new-loc)))
        (recur (zip/next loc))))))

(defn get-nodes
  "Returns a list of nodes from `zipper` that return `true` from the `matcher` predicate fn.
  The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil)."
  [zipper matcher]
  (loop [loc zipper
         acc []]
    (if (zip/end? loc)
      acc
      (if (matcher loc)
        (recur (zip/next loc) (conj acc (zip/node loc)))
        (recur (zip/next loc) acc)))))

(defn match-tag
  "Returns a `matcher` fn that matches any node containing the specified `key` as its `:tag` value."
  [key]
  (fn [loc]
    (let [node (zip/node loc)
          {:keys [tag]} node]
      (= tag key))))

entry-nodes

Slurp the XML from the given URL. This returns a string which can be parsed with xml/parse-str. The feed itself has some extra data we don’t need, so I want to turn it into a zipper and get a list of just the entry nodes, which are the posts in the blog.

(defn feed-str->entries
  "Returns a sequence of parsed article entry nodes from an XML feed string."
  [s]
  (-> s
      (xml/parse-str {:namespace-aware false})
      zip/xml-zip
      (get-nodes (match-tag :entry))))

entry-transforms

The entire feed has been parsed down to a sequence of entries, each of which can be considered its own tree of nodes. Node transforms can now be built to work with each entry individually.

normalize

Each entry can be ‘flattened’ down a bit, so I have a normalize function to help with that. Content within any node is a sequence of strings or other nodes. At this stage, all strings within the entry’s content are empty or newline characters and so can be filtered out.

There are two special elements: links and the author content. Links have empty :content tags but need the :href from the attributes instead, so a cond is built to handle this. The author map is built separately, using the same map function as with the rest of the content. Then, the content and author maps are merged to form the flat, normalized map, which can be processed further.

(defn normalize-entry
  "Normalizes the entry node by flattening content into a map."
  [entry]
  (let [content (filter map? (:content entry))
        f (fn [{:keys [tag content] :as node}]
            (let [val (cond (= tag :link) (get-in node [:attrs :href])
                            :else (first content))]
                {tag val}))
        author-map (->> content
                        (filter #(= (:tag %) :author))
                        first :content
                        (filter map?)
                        (map f)
                        (apply merge))]
   (apply merge (conj
                 (map f (remove #(= (:tag %) :author) content))
                 author-map))))

clean-html

Since no external libraries are used, I am manipulating XML strings slightly to keep the XML parser from complaining about html tags that don’t have terminating tags, like <br> and <img>. At the same time, I unwrap image tags from figures, which is how Hey.com wraps images in entries.

This string cleaning method is as bit of a hack, but works fine and is meant to allow clojure.data.xml to continue being used for further parsing/transforming steps later on in the script.

The clean-html function is run on every entry’s content string after normalization.

(defn unwrap-img-from-figure
  "Returns the simplified `:img` node from its parent node."
  [node]
  (let [img-node (-> node
                 zip/xml-zip
                 (get-nodes (match-tag :img))
                 first)
        new-attrs (-> img-node :attrs
                      (dissoc :srcset :decoding :loading))]
    (assoc img-node :attrs new-attrs)))

(defn clean-html
  "Cleans up the html string `s`.
  The string is well-formed html, but is coerced into XML conforming form by closing <br> and <img> tags.
  The emitted XML string has the <\\?xml...> tag stripped.
  This cleaning is done so that clojure.data.xml can continue to be used for parsing in later stages."
  [s]
  (let [s (-> s
              (str/replace "<br>" "<br></br>")
              (str/replace #"<img[\w\W]+?>" #(str %1 "</img>")))]
    (-> s
        (xml/parse-str {:namespace-aware false})
        zip/xml-zip
        (edit-nodes (match-tag :figure) unwrap-img-from-figure)
        xml/emit-str
        (str/replace #"<\?xml[\w\W]+?>" ""))))

node-transforms

The .edn file output will have a Hiccup data structure as its :post value. So, I need to build a set of functions that transform XML nodes (defrecords, which can be treated just as Clojure maps) into Hiccup-style vectors (eg. [:p {:display "inline-block"} "This is the content of a <p> tag.]).

dispatch

I want to dispatch slightly different behaviour based on the element tag, so will use a multimethod. I like to build in a simple check in the dispatch function for lists of nodes. This way, I can handle recursive use of node->hiccup by building the :list method appropriately.

(defmulti node->hiccup
  (fn [node]
    (cond
      (map? node) (:tag node)
      (and (seqable? node) (not (string? node))) :list
      :else :string)))

simple-cases

I don’t need much special behaviour, so the default ‘catch-all’ method will do most of the work. A simple string case and div case are also given.

(defmethod node->hiccup :string
  [node]
  (when-not (= (str/trim node) "") node))

(defmethod node->hiccup :br [_] [:br])
(defmethod node->hiccup :div [node] (node->hiccup (:content node)))

(defmethod node->hiccup :default
  [{:keys [tag attrs content]}]
  [tag attrs (node->hiccup content)])

(defmethod node->hiccup :img
  [{:keys [tag attrs]}]
  [tag (assoc attrs :style {:max-width 500})])

List Case

This case has a bit of machinery to it. Every time the list method is used, it means that a sequence of nodes have to be handled. To clean up the structure, I am building a flattening function that runs on each list. This flatten function will flatten everything down completely, except for hiccup vectors. I can’t simply mapcat everything because it would destry the hiccup-style structure, as vectors can be flattened down to their elements. The result of selective-flatten is a flat list of strings and/or hiccup elements.

(defn de-dupe
  "Remove only consecutive duplicate entries from the `list`."
  [list]
  (->> list
       (partition-by identity)
       (map first)))

(defn selective-flatten
  ([l] (selective-flatten [] l))
  ([acc l]
   (if (seq l)
     (let [item (first l)
           xacc (if (or (string? item)
                        (and (vector? item) (keyword? (first item))))
                 (conj acc item)
                 (into [] (concat acc (selective-flatten item))))]
       (recur xacc (rest l)))
     (apply list acc))))

(defmethod node->hiccup :list
  [node]
  (->> node
       (map node->hiccup)
       (remove nil?)
       de-dupe
       selective-flatten))

re-grouping

The flattened list of hiccup elements can then be processed and re-grouped on the basis of inline elements and string-br pairs. The html from hey.com blog posts has a lot of <br> tags and plain strings. I think that comes from the fact that it’s html formatted to be viewed by email readers. However, for re-hosting to my own site, I want to use proper html structure, and so I want to group plain strings and <br> tags into <p> tags. I also need to make sure ul, ol, li, em, and strong tags are handled appropriately, so I have some grouping to do.

I also de-dupe the list which can be helpful in eliminating extra newlines. There is a slight risk of this eliminating a deliberately duplicated sentence, but I’ll just accept that as a potential weakness to this solution. I don’t think I’ll use that writing style at all anyway.

(defn inline-elem? [item] (when (#{:em :strong :a} (first item)) true))
(defn inline? [item] (or (string? item) (inline-elem? item)))

(defn group-inline
  "Groups the `list` of strings and Hiccup elements using the `inline?` predicate and wraps them in <p> tags.
  Once all groups are wrapped, the list is flattened again and any remaining <br> tags are removed."
  [list]
  (let [groups (partition-by inline? list)
        f (fn [l]
            (if (not= (first (first l)) :br)
              (into [:p] l)
              l))]
    (->> groups
         (map f)
         selective-flatten
         (remove #(= :br (first %))))))

edn

Put all of the node transforms and list manipulations together to build an entry->edn function.

(defn html-str->hiccup
  "Parses and converts an html string `s` into a Hiccup data structure."
  [s]
  (-> s
      (xml/parse-str {:namespace-aware false})
      node->hiccup
      group-inline
      de-dupe))

(defn entry->edn
  "Converts a parsed XML entry node into a Hiccup data structure."
  [entry]
  (let [entry (normalize-entry entry)]
    {:id (:id entry)
     :file-contents (assoc entry :post (->> entry :content
                                            clean-html
                                            html-str->hiccup))}))

html

Since I have the parsing machinery, it’s trivial to build an html page export function now. I simply have to make a document structure with Hiccup and place the content from the entry inside.

NOTE: I have a (str/replace #"</br>" "") hack in this fn because I cannot figure out why my Babashka script is emitting closing br tags. In the REPL it works fine… If I leave the closing tags there, my web browser interprets it as two <br> tags instead, making the page render incorrectly.

(defn readable-date
  "Format the date string `s` into a nicer form for display."
  [s]
  (as-> s s
    (str/split s #"[a-zA-Z]")
    (str/join " " s)))

(defn entry->html
  "Converts a parsed XML entry node into an html document."
  [entry]
  (let [entry (normalize-entry entry)
        info-span (fn [label s]
                    [:span {:style {:display "block"
                                    :margin-bottom "2px"}}
                     [:strong label] s])
        post (->> entry :content
                   clean-html
                   html-str->hiccup)]
    (assoc entry :file-contents
           (->
            (str
            "<!DOCTYPE html>\n"
            (html
             {:mode :html}
             [:head
              [:meta {:charset "utf-8"}]
              [:title (:title entry)]]
             [:body
              [:div {:class "post-info"}
               (info-span "Author: " (:name entry))
               (info-span "Email: " (:email entry))
               (info-span "Published: " (readable-date (:published entry)))
               (info-span "Updated: " (readable-date (:updated entry)))]
              [:a {:href (:link entry)} [:h1 (:title entry)]]
              post]))
           (str/replace #"</br>" "")))))

CLI

The CLI handles the actual running of the program. I have a save! function that does the work, and -main is what is invoked when running the program via bb rss-saver.clj -u URL in your terminal or via a crontab entry.

The save! function can appear a bit confusing at first. It’s just doing the following things:

  1. detecting to save as edn or html from the options map.
  2. Downloading the current feed XML as a string and saving in memory.
  3. Loading the previous XML feed (if one exists) from the post directory and saving in memory.
  4. Parsing both feed strings into two lists of entry nodes.
  5. Creating the list of new entries by filtering the current nodes against previous nodes
  6. For every new entry, save with the appropriate transform function, as determined by the opts map.
(def cli-options
  [["-h" "--help"]
   ["-u" "--url URL" "The URL of the RSS feed you want to save."]
   ["-d" "--dir DIR" "The directory where articles will be saved."
    :default "./posts"]
   ["-f" "--format FORMAT" "The format of saved articles. Either 'html' or 'edn' for a Clojure Map with the post saved as Hiccup style syntax. Defaults to edn if unspecified."
    :default "edn"]
   ["-c" "--clear" "Clear the cached copy of the previous feed."]
   ["-s" "--silent" "Silence the script's output."]])

(defn clear!
  [opts]
  (let [prev-fname (str (:dir opts) "/" "previous-feed.atom")]
    (sh "rm" "-f" prev-fname)))

(defn save!
  [opts]
  (let [save-fn (get {"html" entry->html
                      "edn" entry->edn} (:format opts))
        cur-str (slurp (:url opts))
        prev-fname (str (:dir opts) "/" "previous-feed.atom")
        prev-str (when (.exists (io/file prev-fname))
                   (slurp prev-fname))
        prev (when prev-str (feed-str->entries prev-str))
        cur (feed-str->entries cur-str)
        entries (remove (into #{} prev) cur)]
    (if (> (count entries) 0)
      (do
        (when-not (:silent opts)
          (println "Handling" (count entries) "entries as" (str (:format opts) ".")))

        ;; always create the posts directory
        (sh "mkdir" "-p" (:dir opts))

        ;; for each entry, transformed with the appropriate fn, save the file with id.ext
        (doseq [{:keys [id file-contents]} (mapv save-fn entries)]
          (let [fname (str
                       (:dir opts) "/"
                       (second (str/split id #"/")) "."
                       (:format opts))]
            (spit fname file-contents)))
        (spit prev-fname cur-str))

      ;; when there are no new entries, simply tell the user and then do nothing.
      (when-not (:silent opts)
        (println "No changes found in feed.")))))

(defn -main
  [& args]
  (let [parsed (cli/parse-opts args cli-options)
        opts (:options parsed)]
    (cond
      (:help opts)
      (println "Usage:\n" (:summary parsed))

      (nil? (:url opts))
      (when-not (:silent opts)
        (println "Please specify feed URL."))

      (not (#{"html" "edn"} (:format opts)))
      (when-not (:silent opts)
        (println "Invalid format:" (:format opts)))

      :else
      (do
        (when (:clear opts) (clear! opts))
        (save! opts)))))
;; apply -main to the args because I call this script with bb rss-saver.clj -u URL
;; if you run this script with clj -m, these two s-exprs should be commented out.
(apply -main *command-line-args*)
(shutdown-agents)

script

This section collects the named src blocks and tangles them to the rss-saver.clj script file.

<<shebang>>
<<ns>>
;;  Zipper Tools
;; --------------

<<zipper-tools>>
;;  Entry Nodes
;; -------------

<<entry-nodes>>
;;  Entry Transforms
;; ------------------

<<normalize>>
<<clean-html>>
;;  Node Transforms
;; -----------------

<<mm-dispatch>>
<<mm-simple-cases>>
<<mm-list-case>>
<<re-grouping>>
;;  entry->
;; ---------

<<to-edn>>
<<to-html>>
;;  CLI
;; -----

<<CLI>>
<<invoke>>

tests

I’ll just use noweb tangling to create a version of the script with tests built in. Then, invoke (run-tests) at the bottom of the script.

This way, rss-saver.clj and rss-saver-tests.clj will both pull from the same sources and be independently runnable with bb.

ns

(ns rss-saver.main-test
  (:require [clojure.string :as str]
            [clojure.data.xml :as xml]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [clojure.java.shell :as sh :refer [sh]]
            [clojure.tools.cli :as cli]
            [clojure.test :as t :refer [deftest is]]
            [hiccup2.core :refer [html]]))

entries

Create some data for testing.

;; this is a pull from my feed.atom at a point where 7 posts existed
(def entries (feed-str->entries (slurp "sample-feed.atom")))
(def raw-entry-content (:content (nth entries 4)))
(def normalized-entry (normalize-entry (nth entries 4)))

(def normalized-entry-keys
  #{:name :email :id
    :published :updated
    :link :title :content})

(deftest entry-tests
  (is (= 7 (count entries)))
  (is (seq? raw-entry-content))
  (is (= (type (:content normalized-entry)) java.lang.String))
  (is (= normalized-entry-keys (into #{} (keys normalized-entry)))))

images

(defn get-figure-nodes
  [s]
  (let [s (-> s
              (str/replace "<br>" "<br></br>")
              (str/replace #"<img[\w\W]+?>" #(str %1 "</img>")))]
    (-> s
        (xml/parse-str {:namespace-aware false})
        zip/xml-zip
        (get-nodes (match-tag :figure)))))

(def fig-img (-> (:content normalized-entry)
                 get-figure-nodes
                 first))

(def unwrapped-img (unwrap-img-from-figure fig-img))
                    
(deftest image-tests
  (is (= :figure (:tag fig-img)))
  (is (= :img (:tag unwrapped-img))))

node->hiccup

(def some-string "Hi there!")
(def empty-string "  ")
(def br {:tag :br :attrs {} :content '()})

(def img-attrs {:src "https://www.fillmurray.com/300/200"
                :alt "Bill Murray is great!"})
(def img {:tag :img
          :attrs img-attrs
          :content '()})
(def img-result [:img (merge img-attrs {:style {:max-width 500}})])

(def div {:tag :div
          :attrs {:a 1 :b 2}
          :content (list some-string br img)})
(def div-result (list "Hi there!" [:br] img-result))

(def default (assoc div :tag :anything))
(def default-result [:anything (:attrs div) div-result])

(def duplicates ["a" "a" "a" "b" "a" "c" "c" "d" "d" "c"])
(def nested (list (list (repeat 4 "a") (repeat 3 [:br]) "a" [:em "b"] "c")))

(deftest node->hiccup-tests
  (is (= (node->hiccup some-string) "Hi there!"))
  (is (nil? (node->hiccup empty-string)))
  (is (= (node->hiccup br) [:br]))
  (is (= (node->hiccup img) img-result))
  (is (= (node->hiccup div) div-result))
  (is (= (node->hiccup default) default-result))
  (is (= (de-dupe duplicates) ["a" "b" "a" "c" "d" "c"]))
  (is (= (selective-flatten nested)
         (list "a" "a" "a" "a" [:br] [:br] [:br] "a" [:em "b"] "c")))
  (is (= (group-inline (selective-flatten nested))
         (list [:p "a" "a" "a" "a"] [:p "a" [:em "b"] "c"]))))

run-tests

(t/run-tests)

test-script

Gathering the named src blocks to build the test script.

<<test-ns>>
;;  Zipper Tools
;; --------------

<<zipper-tools>>
;;  Entry Nodes
;; -------------

<<entry-nodes>>
;;  Entry Transforms
;; ------------------

<<normalize>>
<<clean-html>>
;;  Node Transforms
;; -----------------

<<mm-dispatch>>
<<mm-simple-cases>>
<<mm-list-case>>
<<re-grouping>>
;;  entry->
;; ---------

<<to-edn>>
<<to-html>>
;;  Tests
;; -------

<<test-entries>>
<<test-images>>
<<test-node->hiccup>>
<<test-run>>