Seqs of Clojure Zippers

Thus far, in the first two parts of what is turning into a series, when walking Clojure zippers, we’ve been mostly using a pattern like this:

(require '(clojure [zip :as zip]))
(defn recursion-for-recursion
  "Walking around doing nothing."
  [loc]
  (if (zip/end? loc)
    loc
    (recur (zip/next loc))))

Now, obviously, this function wouldn’t do anything except run the zipper out to its :end. But that has been our pattern so far.

Recursion smell

But this is Clojure and every use of recur should make you wonder for a second whether or not you can accomplish the same thing using a sequence. Take a look at this function, that I ran into here:

(defn leafnodes
  "Returns all leaf nodes in loc. "
  [loc]
  (filter (complement zip/branch?) ;filter only non-branch nodes
          (take-while (complement zip/end?) ;take until the :end
                      (iterate zip/next loc))))

This could be rewritten with our familiar pattern:

(defn leafnodes-r
  "Recursive version of leafnodes."
  [loc accum]
  (if (zip/end? loc)
    accum
    (recur (zip/next loc)
           (if (zip/branch? loc)
             accum
             (conj accum loc)))))

It is interesting to see these functions side by side, or at least one above the other. (iterate zip/next loc) does the actual (lazy) recursion; (take-while (complement zip/end?)…) limits the recursion so that we stop at the end. By this time, it is pretty clear that we are getting a seq of locations that we could do basically anything with. filter is used here to take only the leaf nodes, which is what we get in the end.

Let’s try some stuff:

(def zzz (zip/vector-zip [:a [:b [:c :d] :e]]))
(map zip/node (leafnodes zzz))
; (:a :b :c :d :e)

Mission accomplished: all leaf nodes. What about the recursive version?

(map zip/node (leafnodes-r zzz []))
; (:a :b :c :d :e)
(= (leafnodes zzz) (leafnodes-r zzz []))
; true

;; just for fun
(flatten (zip/root zzz))
; (:a :b :c :d :e)

That’s fine then, it works. The powerful thing here is that we now have the full combined firepower of the seq and zipper abstractions for extracting information from a tree.

Parsing HTML

Let’s have some fun with this pattern playing with everyone’s favorite form of XML. We are going to play around with extracting the links from an HTML page. Here is our namespace declaration:

(ns info.josf.examples.zipper-seq
  (:require [clojure.zip :as zip]
            [clojure.data.xml :as dx]))

I’ve actually put all these examples into a Github repo, which could be helpful if you want to follow along in your REPL.

First, we grab a page and stuff it into a zipper:

(defn zappy
  "grab a webpage we can trust, and parse it into a zipper"
  []
  (-> "http://richhickey.github.io/clojure/clojure.zip-api.html"
      clojure.java.io/input-stream
      dx/parse
      zip/xml-zip))

There are three simple steps here: dx/parse wants a stream, hence java.io. It parses the page then we pass that on to zip/xml-zip, which is the specialized zipper-builder for XML. Even here there is the node-space/zipper-space thing going on: the first function builds the actual tree, and zip/xml-zip wraps it up as as zipper. Now let’s do something.

(defn get-links
"Returns a list of href attributes from :a nodes in an xml zipper."
  [loc]
  (map #(get-in (zip/node %) [:attrs :href])
       (filter #(= :a (:tag (zip/node %)))
               (take-while (complement zip/end?)
                           (iterate zip/next loc)))))
(get-links zappy)
; ("index.html" "index.html" "#" "index.html"
; "branch-1.1.x/index.html" "index.html" "api-index.html"
; "#" "clojure.core-api.html".... etc.)

There is almost nothing to explain here. It is the same basic function as before, except we’ve adapted the filter predicate to work with an XML tree, and we’re using map to grab the :href value from each node.

If you’ve never dealt with XML trees in Clojure before, this might look slightly confusing. The basic idea with this kind of tree representation is that each node is a map with three keys. (clojure.data.xml actually uses records; non-text nodes are represented as Elements, but it amounts to the same thing.) It looks like this:

{:tag :a,
 :attrs {:href "http://example.com"}
 :content ("The link text")}

In our function, when we say (map #(get-in (zip/node %) [:attrs :href]), we are just grabbing the href out of the node’s attrs map.

This is a pretty basic task: walk a tree, grab some data, report back with a list. Any language can do this of course. Clojure zippers let you do it fairly simply, and in fact this little function seems at least as simple as something comparable in XPath/XSLT.

A contrived twist

Now let’s try something a little crazier. We’ll parse the same document, but this time we want more than just the href values: we also want to know the depth, relative to the document root, of each link, as well as how many sub-nodes there are in the <a/>. For a given link, we want something like this:

{:href "index.html" :depth 5 :subnodes 1}

Step one: just a seq

The get-links function returns a seq of URL strings, but for this we need locs, so we have to simplify the function:

(defn get-a-zips
  "Return a lazy seq of :a locations"
  [loc]
  (filter #(= :a (:tag (zip/node %)))
          (take-while (complement zip/end?)
                      (iterate zip/next loc))))
;;; Let's run put the result in a var. Printing the result
;;; will eat up a lot of space in your REPL.
(def a-list (get-a-zips zapi))

This is the same function as before, less the mapping that extracted the URLs. It is just going to give us the bare list of locations. The rest of the work is coming up with a function to map over this seq.

Step two: ancestor count

Suddenly we realize that each element in the list of link node zippers has access to the entire tree. Even though it feels like we are pulling the tree apart, filtering out everything but the parts that we are interested in, each part is actually one of those grains of sand that contains the entire universe (so to speak).

Now, we all love Clojure’s seq interface, but there is something that can trip you up. Or at least that is what happened to me. I’ve talked a lot about the distinction between “zipper space” and “node space”, because, in my experience anyway, this is often the difference between getting it or not.

So when we then start thinking about all of this through the seq interface, it can get a little bit more confusing: we’ve got a seq, but a seq of what?

I expected to get a bunch of nodes. This is such a common pattern: walk a tree, look at the nodes, accumulate data, do something. What is really cool, and possibly confusing, about Clojure zippers, is that often you don’t get nodes, you get locations. That means that you aren’t extracting data, you are locating it in the tree. Each element of the seq has access to the entire tree. Each element that gets returned is still a zipper. (Unless it’s not, of course: functions like zip/children return nodes, not locs.)

So, given a loc, how do you count its ancestors? Well, you write a function.

(defn ancestor-count
  "How many ancestors do I have?"
  [loc]
  (count (take-while zip/up
               (iterate zip/up loc))))

This is the same pattern as before, except that we iterate using zip/up, and we keep going up until zip/up returns nil, which is what can only happen at the root node.

Let’s take it for a spin:

(map ancestor-count a-list)
; (4 5 7 9 9 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 11 12 12 12 12 12 12
12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 4)

Looks good to me… except wait a second! We can do the same thing in fewer characters using the zipper’s path:

(defn ancestor-count
  "How many ancestors do I have?"
  [loc]
  (count (zip/path loc)))

Step three: how many kids?

For simplicity’s sake, we’ll just count the total number of descendant nodes. This is one of those times when using a sub-zipper seems like a good idea. By that I mean, as I discussed in my first post on zippers, that zippers are cheap to make, and that it can be useful sometimes to turn a given node into a “sub-zipper”.

(defn descendant-count
  "How many descendants do I have?"
  [loc]
  (let [sub-zip (zip/xml-zip (zip/node loc))]
    (count (take-while (complement zip/end?)
                       (iterate zip/next (zip/down sub-zip))))))

(map descendant-count a-list)
; (1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1)

Not the most exciting results, since all the links on the page apparently only have text content. You get the idea though. We could have counted siblings, for example, or done any other kind of calculation.

Step four : all together

Just for consistency we need one more little function to grab the href as well:

(defn extract-url
  "Get the href value from an <a> zipper. Takes a loc."
  [loc]
  (get-in (zip/node loc) [:attrs :href]))

And now we can do what we set out to do:

(defn depth-href-kidcount
  "loc-seq is a seq of <a> node zipper locations."
  [loc-seq]
  (map
   (fn [loc]
    {:depth (ancestor-count loc)
     :href (extract-url loc)
     :subnodes (descendant-count loc)})
   loc-seq))

Deep breath…

(depth-href-kidcount a-list)
; ({:depth 4, :href "index.html", :subnodes 1}
; {:depth 5, :href "index.html", :subnodes 1} {:depth 7,
; :href "#", :subnodes 1} {:depth 9, :href "index.html",
; :subnodes 1} {:depth 9, :href "branch-1.1.x/index.html", :subnodes 1}
; ...etc.)

Simple as that!

Obviously, this was a contrived example, but I hope that it shows how there are a lot of creative to mix the seq and the zipper interfaces, and that it can be a powerful way to work with your data. In my opinion, zippers aren’t really that complicated, but they do require us to adjust our thinking. Once we do, there are lots of fun things to do.

There is one inherent limit to this pattern, and that is that there is no way to systematically edit the tree once it has been iterated into a sequence. All the locs in the are snapshots of, or references to, the exact same tree. Any cumulative editing will just produce a sequence of new trees, each one modified once, with no way to combine all of the edits into the same tree.

Comments