Seqs of Clojure Zippers

Thus far, in the first two parts of what is turning into a series, when walking Clojure zippers, we’ve been mostly using a pattern like this:

(require '(clojure [zip :as zip]))
(defn recursion-for-recursion
  "Walking around doing nothing."
  [loc]
  (if (zip/end? loc)
    loc
    (recur (zip/next loc))))

Now, obviously, this function wouldn’t do anything except run the zipper out to its :end. But that has been our pattern so far.

Recursion smell

But this is Clojure and every use of recur should make you wonder for a second whether or not you can accomplish the same thing using a sequence. Take a look at this function, that I ran into here:

(defn leafnodes
  "Returns all leaf nodes in loc. "
  [loc]
  (filter (complement zip/branch?) ;filter only non-branch nodes
          (take-while (complement zip/end?) ;take until the :end
                      (iterate zip/next loc))))

This could be rewritten with our familiar pattern:

(defn leafnodes-r
  "Recursive version of leafnodes."
  [loc accum]
  (if (zip/end? loc)
    accum
    (recur (zip/next loc)
           (if (zip/branch? loc)
             accum
             (conj accum loc)))))

It is interesting to see these functions side by side, or at least one above the other. (iterate zip/next loc) does the actual (lazy) recursion; (take-while (complement zip/end?)…) limits the recursion so that we stop at the end. By this time, it is pretty clear that we are getting a seq of locations that we could do basically anything with. filter is used here to take only the leaf nodes, which is what we get in the end.

Let’s try some stuff:

(def zzz (zip/vector-zip [:a [:b [:c :d] :e]]))
(map zip/node (leafnodes zzz))
; (:a :b :c :d :e)

Mission accomplished: all leaf nodes. What about the recursive version?

(map zip/node (leafnodes-r zzz []))
; (:a :b :c :d :e)
(= (leafnodes zzz) (leafnodes-r zzz []))
; true

;; just for fun
(flatten (zip/root zzz))
; (:a :b :c :d :e)

That’s fine then, it works. The powerful thing here is that we now have the full combined firepower of the seq and zipper abstractions for extracting information from a tree.

Parsing HTML

Let’s have some fun with this pattern playing with everyone’s favorite form of XML. We are going to play around with extracting the links from an HTML page. Here is our namespace declaration:

(ns info.josf.examples.zipper-seq
  (:require [clojure.zip :as zip]
            [clojure.data.xml :as dx]))

I’ve actually put all these examples into a Github repo, which could be helpful if you want to follow along in your REPL.

First, we grab a page and stuff it into a zipper:

(defn zappy
  "grab a webpage we can trust, and parse it into a zipper"
  []
  (-> "http://richhickey.github.io/clojure/clojure.zip-api.html"
      clojure.java.io/input-stream
      dx/parse
      zip/xml-zip))

There are three simple steps here: dx/parse wants a stream, hence java.io. It parses the page then we pass that on to zip/xml-zip, which is the specialized zipper-builder for XML. Even here there is the node-space/zipper-space thing going on: the first function builds the actual tree, and zip/xml-zip wraps it up as as zipper. Now let’s do something.

(defn get-links
"Returns a list of href attributes from :a nodes in an xml zipper."
  [loc]
  (map #(get-in (zip/node %) [:attrs :href])
       (filter #(= :a (:tag (zip/node %)))
               (take-while (complement zip/end?)
                           (iterate zip/next loc)))))
(get-links zappy)
; ("index.html" "index.html" "#" "index.html"
; "branch-1.1.x/index.html" "index.html" "api-index.html"
; "#" "clojure.core-api.html".... etc.)

There is almost nothing to explain here. It is the same basic function as before, except we’ve adapted the filter predicate to work with an XML tree, and we’re using map to grab the :href value from each node.

If you’ve never dealt with XML trees in Clojure before, this might look slightly confusing. The basic idea with this kind of tree representation is that each node is a map with three keys. (clojure.data.xml actually uses records; non-text nodes are represented as Elements, but it amounts to the same thing.) It looks like this:

{:tag :a,
 :attrs {:href "http://example.com"}
 :content ("The link text")}

In our function, when we say (map #(get-in (zip/node %) [:attrs :href]), we are just grabbing the href out of the node’s attrs map.

This is a pretty basic task: walk a tree, grab some data, report back with a list. Any language can do this of course. Clojure zippers let you do it fairly simply, and in fact this little function seems at least as simple as something comparable in XPath/XSLT.

A contrived twist

Now let’s try something a little crazier. We’ll parse the same document, but this time we want more than just the href values: we also want to know the depth, relative to the document root, of each link, as well as how many sub-nodes there are in the <a/>. For a given link, we want something like this:

{:href "index.html" :depth 5 :subnodes 1}

Step one: just a seq

The get-links function returns a seq of URL strings, but for this we need locs, so we have to simplify the function:

(defn get-a-zips
  "Return a lazy seq of :a locations"
  [loc]
  (filter #(= :a (:tag (zip/node %)))
          (take-while (complement zip/end?)
                      (iterate zip/next loc))))
;;; Let's run put the result in a var. Printing the result
;;; will eat up a lot of space in your REPL.
(def a-list (get-a-zips zapi))

This is the same function as before, less the mapping that extracted the URLs. It is just going to give us the bare list of locations. The rest of the work is coming up with a function to map over this seq.

Step two: ancestor count

Suddenly we realize that each element in the list of link node zippers has access to the entire tree. Even though it feels like we are pulling the tree apart, filtering out everything but the parts that we are interested in, each part is actually one of those grains of sand that contains the entire universe (so to speak).

Now, we all love Clojure’s seq interface, but there is something that can trip you up. Or at least that is what happened to me. I’ve talked a lot about the distinction between “zipper space” and “node space”, because, in my experience anyway, this is often the difference between getting it or not.

So when we then start thinking about all of this through the seq interface, it can get a little bit more confusing: we’ve got a seq, but a seq of what?

I expected to get a bunch of nodes. This is such a common pattern: walk a tree, look at the nodes, accumulate data, do something. What is really cool, and possibly confusing, about Clojure zippers, is that often you don’t get nodes, you get locations. That means that you aren’t extracting data, you are locating it in the tree. Each element of the seq has access to the entire tree. Each element that gets returned is still a zipper. (Unless it’s not, of course: functions like zip/children return nodes, not locs.)

So, given a loc, how do you count its ancestors? Well, you write a function.

(defn ancestor-count
  "How many ancestors do I have?"
  [loc]
  (count (take-while zip/up
               (iterate zip/up loc))))

This is the same pattern as before, except that we iterate using zip/up, and we keep going up until zip/up returns nil, which is what can only happen at the root node.

Let’s take it for a spin:

(map ancestor-count a-list)
; (4 5 7 9 9 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 11 12 12 12 12 12 12
12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 4)

Looks good to me… except wait a second! We can do the same thing in fewer characters using the zipper’s path:

(defn ancestor-count
  "How many ancestors do I have?"
  [loc]
  (count (zip/path loc)))

Step three: how many kids?

For simplicity’s sake, we’ll just count the total number of descendant nodes. This is one of those times when using a sub-zipper seems like a good idea. By that I mean, as I discussed in my first post on zippers, that zippers are cheap to make, and that it can be useful sometimes to turn a given node into a “sub-zipper”.

(defn descendant-count
  "How many descendants do I have?"
  [loc]
  (let [sub-zip (zip/xml-zip (zip/node loc))]
    (count (take-while (complement zip/end?)
                       (iterate zip/next (zip/down sub-zip))))))

(map descendant-count a-list)
; (1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1)

Not the most exciting results, since all the links on the page apparently only have text content. You get the idea though. We could have counted siblings, for example, or done any other kind of calculation.

Step four : all together

Just for consistency we need one more little function to grab the href as well:

(defn extract-url
  "Get the href value from an <a> zipper. Takes a loc."
  [loc]
  (get-in (zip/node loc) [:attrs :href]))

And now we can do what we set out to do:

(defn depth-href-kidcount
  "loc-seq is a seq of <a> node zipper locations."
  [loc-seq]
  (map
   (fn [loc]
    {:depth (ancestor-count loc)
     :href (extract-url loc)
     :subnodes (descendant-count loc)})
   loc-seq))

Deep breath…

(depth-href-kidcount a-list)
; ({:depth 4, :href "index.html", :subnodes 1}
; {:depth 5, :href "index.html", :subnodes 1} {:depth 7,
; :href "#", :subnodes 1} {:depth 9, :href "index.html",
; :subnodes 1} {:depth 9, :href "branch-1.1.x/index.html", :subnodes 1}
; ...etc.)

Simple as that!

Obviously, this was a contrived example, but I hope that it shows how there are a lot of creative to mix the seq and the zipper interfaces, and that it can be a powerful way to work with your data. In my opinion, zippers aren’t really that complicated, but they do require us to adjust our thinking. Once we do, there are lots of fun things to do.

There is one inherent limit to this pattern, and that is that there is no way to systematically edit the tree once it has been iterated into a sequence. All the locs in the are snapshots of, or references to, the exact same tree. Any cumulative editing will just produce a sequence of new trees, each one modified once, with no way to combine all of the edits into the same tree.

Clojure Zippers: Structure Editing With Your Mind

This is a follow-up to my Getting acquainted with Clojure zippers post. That was more of an introduction, this article is going to more about some of the difficulties we run into when using zippers to modify tree structures. We are going to try to write a function to modify a tree, and along we will run into lots of problems that, once we deal with them, will eventually lead to zipper enlightenment. The code here isn’t meant as recipes or a perfect example, but just a way to work through some issues that come up when working with Clojure zippers.

Immutable…

The neat thing about zippers in Clojure is that they allow tree manipulation to be purely functional, Zipper manipulation functions can be pure : given the same zipper they will always return the same thing. This is true of movement functions like zip/down and zip/next, since position inside the tree is part of what we are feeding into the function: given the same tree, and the same position inside the tree, the result will alway be the same. The tree itself, obviously, doesn’t change just because we’ve moved. The zipper here is an abstraction that separates the tree itself from the notation of our location in the tree, and thus allows us to deal with both explicitly.

…but editable?

What about modifying the tree, though? What happens then?

It turns out that simple edits are simple. There are several functions in clojure.zip :

  • edit

  • append-child

  • insert-child

  • insert-left

  • insert-right

  • remove

  • replace

zip/edit is interesting because its arguments are a location and a function to modify the current node. This kind of editing could go rather far, since the current node can be the top of subtree. In a simple case, though, it could look something like this:

(require '(clojure [zip :as zip]))
(def zzz (zip/vector-zip [1 [:a :b] 2 3 [40 50 60]]))
; #'user/zzz
(-> zzz zip/down (zip/edit inc) zip/root)
; [2 [:a :b] 2 3 [40 50 60]]

The first node in the tree, the 2, has been inced up from 1. Editing is possible!

In an XML based tree we might have more complex editing needs. XML trees are map-based. Each node is a map with a :tag key, an :attrs key and a :content key. If we wanted to change a <ul> to an <ol> in an HTML document, the call to zip/edit would look something like this:

(zip/edit loc #(assoc % :tag :ol))

We’ll stick to vector trees here, in the interest of simplicity and short codes snippets.

Let’s try inserting a node now. We’ll try to put a :c after the :b in the first subvector.

(-> zzz zip/down ;; we are at 1
              zip/right ;; this is the subvector
          (zip/append-child :c) ;; add as the last child
          zip/root) ;; back to the top to appreciate our work
; [1 [:a :b :c] 2 3 [40 50 60]]

zip/append-child is a good choice here because we are appending to the end of the vector, and that is what it always does. What about prepending? This time we’ll put a :aa at the beginning.

(-> zzz zip/down zip/right zip/down ;; down to the :a node
       (zip/insert-left :aa)
               zip/root)
[1 [:aa :a :b] 2 3 [40 50 60]]

insert-left: does just what it says. Let’s delete any numbers greater than 10:

(defn death-to-big-numbers
        [loc]
        (if (zip/end? loc)
          loc
          (if (or (not (integer? (zip/node loc)))
                  (< (zip/node loc) 11))
            (recur (zip/next loc)) ;; base case: keep moving
            (recur (zip/next (zip/remove loc)))))) ;; evil big numbers
(zip/root (death-to-big-numbers zzz))
; [1 [:a :b] 2 3 []] ;; they're gone!

This is our standard zipper-walking format. Most of the work is just figuring out which locations need to be removed and passing them off to zip/remove.

Immutable gotchas : let’s wrap

Now let’s try something more complicated. We’re going to start over with a new vector tree.

(def z2 (zip/vector-zip [1 2 :a :b [3 4 :c :d 5] :e]))

Our goal this time is to make sure that all keyword nodes are wrapped together in their own subtrees, with adjacent keywords wrapped together. We want to be able to take a tree like z2 and produce this:

[1 2 [:a :b] [3 4 [:c :d] 5] [:e]]

(I encountered this problem in an HTML document, trying to wrap adjacent, indented <p> elements inside <blockquote>s.)

Here we go. Strategy number one. We’ll look for keywords and, if we find one, we’ll wrap it and then start appending any right siblings.

First, let’s just try wrapping each keyword individually:

Strategy 1 - a: wrap and go… down
(defn strategy-one-a [loc]
        (if (zip/end? loc)
          loc
          (if (keyword? (zip/node loc))
            (recur (zip/next (zip/edit loc #(vector %))))
            (recur (zip/next loc)))))
;; wait, don't do this: infinite loop

What? Why is this infinite? Well, when we edit the node, we create a new node, [:a], and then we say zip/next. Except that the next node in this case is down… to that :a again! Which we wrap again, and then go find again and again and again. next is relative to the new zipper, not the old one from before the edit. This is one of those immutable/pure-function mind warp things: you want to be able to grab the new state and hold on to the old one too. But here, the old one, where zip/next was going to logically lead to the right sibling, is no longer available to us. Or, to be more precise, it is still available, but it is now a parallel universe where the edit we want to do never really happened. We can’t have the new edit and the old position at the same time, because the old position only exists in a world where the edit never happened.

Back to reality: let’s introduce some logic to make sure we don’t try to wrap keywords that are already inside vectors containing only keywords. In other words, if we run into [:x :y :z] we leave it alone.

First, we write a little function that looks up at a keyword’s parent to see if all the siblings are already keywords:

(defn parent-ok? [loc]
        (when (zip/up loc)
          (every? keyword? (zip/children (zip/up loc)))))

(Note here that zip/children returns nodes, not locs, so we can feed the results directly to keyword? without having to call zip/node.)

And then we plug it into our previous tree walking function:

Strategy 1 b
(defn strategy-one-b [loc]
        (if (zip/end? loc)
          loc
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc)))
            (recur (zip/next (zip/edit loc #(vector %))))
            (recur (zip/next loc)))))

; [[1 2 [:a] [:b] [3 4 [:c] [:d] 5] [:e]] :end] ;; here we see the ended-out zipper

Good: no infinite recursion, and all of our keywords are wrapped, but individually, not grouped together like we want.

Let’s try grabbing them from the left-most keyword. As soon as we hit a keyword, we’ll try to gobble up any keywords to the right before moving on.

Strategy 1 c (starting to get ugly)
(defn strategy-one-c [loc]
        (if (zip/end? loc)
          loc
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc))) ;; up to here, same as before
            (let [right-keys (take-while #(keyword? %) (zip/rights loc))]
                (recur (zip/next
                               (reduce (fn [z n]
                                          (zip/append-child z n)) (zip/edit loc #(vector %)) right-keys))))
            (recur (zip/next loc)))))

The basic structure is the same as before, except that now we try to incorporate keyword nodes to the right of the locations that need to be dealt with (ie. keyword nodes that are not in a parent node containing only keyword children, as established by the parent-ok? function we wrote earlier). Conveniently, the zip library provides zip/rights, which gives us a sequence of nodes (not locs, mind you) that we grab with take-while.

So far nothing extraordinary.

If we do have interesting siblings, we need to bring them into the node we are going to create. That is what we do with reduce here, and it is probably worth breaking down the logic there.

(reduce (fn [z n]
           (zip/append-child z n)) (zip/edit loc #(vector %)) right-keys)

This is the three argument form of reduce, where the second argument provides an initial value. Here our starting value is going to be the original keyword node edited to be wrapped inside a new vector. Then, for each of the keyword siblings to the right, we are going to keep calling zip/append.

It would be the equivalent of this, if we could know ahead of time that right-keys contained :a, :b and :c:

(zip/append (zip/append (zip/append (zip/edit loc #(vector %)) :a) :b) :c)

So, we gather up some more courage and we run the thing:

(strategy-one-c z2)
; [[1 2 [:a :b] [:b] [3 4 [:c :d] [:d] 5] [:e]] :end]

Well, this kind of works, in that we are getting our keywords grouped together. Obviously, we haven’t done anything to remove the nodes that we’ve copied into our aggregating nodes. And look what happens if we have more successive keyword nodes in our input:

(strategy-one-c (zip/vector-zip [1 2 :a :b :c :d :e 3]))
; [[1 2 [:a :b :c :d :e] [:b :c :d :e] [:c :d :e] [:d :e] [:e] 3] :end]

Oh dear. We can’t just leave those siblings out there, because they end up getting siblingized themselves. Before we send our big reduced, zip/appended, zip/edited zipper back to zip/next, we need to prune the original keyword siblings from the zipper. Okay, one last try.

Works but… (Strategy 1 d)
(defn strategy-one-d [loc]
        (if (zip/end? loc)
          loc
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc)))
            (let [right-keys (take-while #(keyword? %) (zip/rights loc))
                  loc-no-rights (reduce (fn [z _] (zip/remove (zip/right z))) loc right-keys)
                  loc-w-incorporated-keys (reduce (fn [z n]
                                                                (zip/append-child z n))
                                                            (zip/edit loc-no-rights #(vector %))
                                                            right-keys)]
                 (recur (zip/next loc-w-incorporated-keys)))
            (recur (zip/next loc)))))

I’ve shifted things around a little here for clarity, but it really isn’t too different from the previous rendition. Inside the let, loc-w-incorporated-keys is the result of the same reduce call. The difference here is that before building our final zipper, we remove the siblings.

zip/remove, after blowing away the node, automatically moves to whatever the previous location would be, as determined by calling zip/prev. This could be a problem if we didn’t know that we were working on right siblings, because if we removed the leftmost node, we would end up at the parent node suddenly, which could confuse things considerably. That won’t be a problem here though. This behaviour also allows us to use the combination of zip/right and zip/remove for as many times as we like, because we just keep removing the next node over each time.

loc-no-rights is the intermediary version of the zipper. And because right-keys is holding nodes and not zipper locations, we can come back to them and append them before moving on with zip/next.

Deep breath…

(strategy-one-d z2)
; [[1 2 [:a :b] [3 4 [:c :d] 5] [:e]] :end]
(strategy-one-d (zip/vector-zip [1 2 :a :b :c :d :e 3]))
; [[1 2 [:a :b :c :d :e] 3] :end]

I confess that at this point I’m just glad to have something that works. I think, however, that you’ll agree that this is just too complicated, too messy and too painful for what should be a relatively simple operation. Going through all these contorsions was meant to show some of the problems that crop up when learning about zippers, and how we have to bend our thinking a little (or a lot) when using them. Now let’s try to do this right…

Strategy 2

This time we are going to do everything by editing parent nodes. Most of what we do will be in what I’ve been calling “node space” and should be simpler. Each time we hit a branch node, we will look at its children, wrap any keyword nodes and then keep moving.

The first step is to write a generic wrapping function. We want it to take a list of things and group together adjacent keywords. Here is what I came up with:

(defn wrap-keywords [lst]
        (into []
              (mapcat
               #(if (keyword? (first %))
                  (list (apply vector %))
                  %)
                (partition-by keyword? lst))))
(wrap-keywords [:a :b :c 1 2 [:d :e] 3 4 :f])
; [[:a :b :c] 1 2 [:d :e] 3 4 [:f]])

The trick here is that partition-by just breaks up the initial sequence into segments depending on how the predicate reacts to each item. With the sequence in the example, it produces this:

(partition-by keyword?  [:a :b :c 1 2 [:d :e] 3 4 :f])
; ((:a :b :c) (1 2 [:d :e] 3 4) (:f))

The rest of our function, the mapcat part, just turns the keyword segments into vectors (apply vector), wraps each vector in a list to put it back on the same level as its siblings, and then concats the whole thing back together, essentially erasing the lists but leaving the vectors. And finally, we use into to box everything up in a vector.

(Note that we can use vectors here because we are using vector trees. In more serious code it would be wiser to use zip/make-node.)

Now it’s time to try this out on a tree.

Strategy 2: go after the parents
(defn strategy-two [loc]
        (if (zip/end? loc)
          loc
          (if (and (zip/branch? loc)
                      (some keyword? (zip/children loc))
                      (not (every? keyword? (zip/children loc))))
            (recur (zip/next (zip/edit loc wrap-keywords)))
            (recur (zip/next loc)))))

And…

(strategy-two z2)
; [[1 2 [:a :b] [3 4 [:c :d] 5] [:e]] :end]
(strategy-two zzz)
; [[1 [:a :b] 2 3 [40 50 60]] :end]

In the interest of full disclosure, I messed up the first time through this and fell into the same infinite recursion trap as before. We still need to make sure that the keyword nodes aren’t already wrapped, since otherwise we’ll end up wrapping them forever. That’s what the (not (every? keyword?…)) stuff is all about.

The rest should be pretty self-explanatory by now. We check to see if we are on a parent node, using zip/branch?. We peek at the node’s children to see if any of them are keywords, and then, when the planets are correctly aligned, we edit the node as though it were a simple seq. Because zip/edit modifies the node, not the zipper.

This, in my opinion, is one of the tricky parts to zippers: knowing when to be in zipper-space or in node-space, knowing how to leverage one or the other depending on what you need to do. At the same time, having these two options, and being able to quickly switch from one to the other, is what gives Clojure zippers all their power.

Getting Acquainted With Clojure Zippers

Zippers, in Clojure, are one of those things you end up running into that can be rather confusing. A little googling of the topic seems to reveal a universe divided between people who understand zippers perfectly and those who don’t get them at all. Zippers are a core library, so they get mentioned in the Clojure books, but they are also a special case that would take several pages to really explain in detail.

Most of all, they are, for a lot of us, slightly strange, a solution to the problem of “traversing” a tree structure without any implied state. In this post, I’m going to try to help those who don’t immediately understand zippers to at least avoid some of the subtleties that tripped me up.

Why “zipper”?

“I can’t use zip files because I don’t have a Zip drive”. Why does everything have to be called “zip”? What is a zipper, anyway?

With their usual terseness, the docs simply say:

Functional hierarchical zipper, with navigation, editing, and enumeration. See Huet
API for clojure.zip
— Rich Hickey

Huet turns out to be Gérard Huet, who published Functional pearl: the Zipper in 1997. Huet insists on how simple the concept is, and almost asks forgiveness for writing a paper about something that really should have been documented long ago:

This simple idea must have been invented on numerous occasions by creative programmers, and the only justification for presenting what ought to be folklore is that it does not appear to have been published, or even well-known.
Functional Pearl: the Zipper
— Gérard Huet

Huet is unsurprisingly one of those people who just “gets it”. Here is how he explains the Zipper:

The basic idea is simple: the tree is turned inside-out like a returned glove, pointers from the root to the current position being reversed in a path structure. The current location holds both the downward current subtree and the upward path.

Now maybe this kind of spatial and logical torsion is intuitive for you. Reading this now, I actually do understand what he means, and it might be “`simple” in a Rich Hickey sense, as in “`uncomplected”, but it certainly isn’t that easy to understand.

A few basic things

So here are a few things I would have liked to have understood to begin with, and that I didn’t fully grasp before going through lots of trial and error and frustration.

Zipper vs. tree vs. node

We know what a tree is. In Clojure we would represent it as something like this, with nested vectors for example:

[1 [:a :b] 2 3 [40 50 60]]

This is just a tree, not a zipper, because a zipper is not a tree: it is a tree plus your position in the tree. Or to be more precise: a tree seen from a particular location in the tree.

So to make a zipper out of that tree, we do this:

user> (require '(clojure [zip :as zip]))
nil
user> (zip/vector-zip [1 [:a :b] 2 3 [40 50 60]])
[[1 [:a :b] 2 3 [40 50 60]] nil]

Notice how the result is just the same tree wrapped in a two-element vector? No big deal, right? Can it be that simple?

Well… this is Huet’s glove, but still right side out. If you want to see the lining, just do this:

user> (->  [1 [:a :b] 2 3 [40 50 60]]
	   zip/vector-zip
	   zip/down
	   zip/right)
[[:a :b] {:l [1], :pnodes [[1 [:a :b] 2 3 [40 50 60]]], :ppath nil, :r (2 3 [40 50 60])}]

There’s the lining, or at least the beginning of it. All that extra stuff is the zipper looking back up toward the root of the tree, to the left and to the right. Moving around the tree means calling a function like zip/down on the current zipper, and getting back a new zipper with some of the nodes shifted around to represent your new position in the tree. This is functional programming. We are always creating new values, not modifying old values. Position in a tree is typically a form of implicit state: we just are somewhere. Here our location becomes explicit, and the zipper infrastructure makes it so we don’t have to think too much about how all of this happens. We don’t ever need to know about :pnodes or :ppath, we just deal with the zipper interface. (Imagine having to do this “by hand”, associng the left nodes over to the right nodes, dissocing them from the left, etc.)

Two forms of movement

While we are talking about movement, it’s worth pointing out that there are two ways of moving through the zipper. The first way is what we’ve just seen: zip/down, zip/up, zip/left, zip/right and functions like zip/root that take us all the way back to the top.

The second way of getting around the tree, zipper-style, is to walk, as in tree-walk. One of the added benefits of using zippers is that you get, for free, a depth-first tree traversal, just by calling zip/next as many times as you need.

Going back to our example tree, with zip/next you can do stuff like this:

;; let's make a var to be easier to read.
 user> (def zzz (zip/vector-zip [1 [:a :b] 2 3 [40 50 60]]))
#'user/zzz
user> (-> zzz zip/next zip/node)
1
user> (-> zzz zip/next zip/next zip/node)
[:a :b]
user> (-> zzz zip/next zip/next zip/next zip/node)
:a
user> (-> zzz zip/next zip/next zip/next zip/next zip/node)
:b
user> (-> zzz zip/next zip/next zip/next zip/next zip/next zip/node)

(Notice that we call zip/node at the end each time, so that we get the value and not the whole zipper. More on that in a second.)

Each time we add a call to zip/next, we get one more node. Without even trying very hard, we get a complete walk of the tree just by calling zip/next a bunch of times. The function just “knows” where to go to hit each node in the tree exactly once.

End games

Now, when walking a zipper, like in any recursive situation, we wouldn’t want to be in an infinite loop. zip/next needs to know when to quit. Which is why the Clojure zip library provides an end? function that tells us if we are there yet.

user> (defn fast-forward [loc]
	(if (zip/end? loc)
	  (zip/node loc)
	  (recur (zip/next loc))))
#'user/fast-forward
user> (fast-forward zzz)
[1 [:a :b] 2 3 [40 50 60]]

Here we’ve zipped all the way to the end, and we end up with the whole tree, that is the root node.

Check out this slighly different version of our fast-forward function:

user> (defn fast-forward-z [loc]
	(if (zip/end? loc)
	  loc
	  (recur (zip/next loc))))

#'user/fast-forward-z
user> (fast-forward-z zzz)
[[1 [:a :b] 2 3 [40 50 60]] :end]

This time we get the zipper, and not the just the node, as in the previous. And we have this strange little :end keyword as the second part of the zipper vector. The zipper has been “consumed”. We can’t walk anymore. zip/next will just keep returning the same thing, and zip/prev will just return nil. The glove has been turned so far inside out that it can’t go any more (yet, strangely, it looks right side out again).

So not only does the zipper represent our position in the tree, it also “knows” where we’ve been, or at least whether we can keep walking or not.

Nodes and locs

So what about nodes and this zip/node function? Really, it is actually fairly simple: a node is of course a tree, a subtree, a branch or a leaf. The node is your data. But it doesn’t know what it’s parents or siblings are, or how far it is from the root.

On the other hand, a location (usually pronounced loc) contains a node (the node is the first element in the zipper vector), but a location is actually the entire zipper, zipped to a certain point. Sometimes you need the node, and sometimes you need the loc. Somehow, figuring out when to look at a node and when to look at a location was one of the major conceptual hurdles that I encountered.

What tripped me up, I think, looking back, was that zippers require you to go back and forth between the zipper view of the world (that is: the whole tree seen from a particular position) and a node view of the world, where we are back in a good old tree structure looking at individual values.

Say we wanted to find the first keyword (in preorder depth-first traversal) in our tired zzz vector. We could do this:

user> (defn first-keyword [loc]
	(if (zip/end? loc)
	  nil
	  (if (keyword? (zip/node loc))
	    (zip/node loc)
	    (recur (zip/next loc)))))
#'user/first-keyword
user> (first-keyword zzz)
:a

The “keywordiness” of a particular location in the tree is distinct from its location. To see if the current node is a keyword, we have to use the zip/node function to slip into nodespace. If, on the other hand, we want to know whether our node has a right or left sibling, for example, we need to be in zipper space. If we want to find nodes of a certain kind that have parents or children of another kind, our tests have to dance between nodes and zippers. This is the kind of thing that seems completely obvious once you understand it, but, at least for me, kept me scratching my head early on.

Zippers are cheap

One last thing that took a while to sink in: zippers are cheap. Huet insists on how zippers are supposed to be fast since, in a very clojurey way, only the differences in path and position need to be taken into account, leaving most of the original data intact.

In a more concrete way, just looking at a zipper, we see that there isn’t much to them. Here is your data:

user> (def your-data [:a :b :c])
#'user/your-data

Here is your data in a zipper:

[[:a :b :c] nil]

It’s just a two-item vector. Well, almost: there is some metadata attached, so it doesn’t work to just stuff some data in a raw vector followed by nil. You have to use functions like vector-zip or xml-zip to create them.

But still, the data structure is very lightweight, especially when the zipper is at the root position of the tree. As a result, you can be creative with when you decide to break out a zipper. Suppose for instance that you know you only want to walk one subtree within a larger tree. You need a zipper to navigate down to the top of the subtree that interests you. Once you’ve found it, rather than figuring out how to constrain your search to just that subtree, it is simpler to just take the node containing the subtree and plunk it into a new zipper. Then you can zip/next to your heart’s content, knowing that you’ll never end up in some weird place you didn’t expect.

I’ll talk about “editing” zippers some other time, but I recently came onto a problem where I wanted to walk a tree several times with different functions each time. I couldn’t just hold onto a reference to the original zipper because I was modifying as I went. I need to be able to chain my zipper-walking functions together in a sequence. The problem was that after each walk, I was at the :end. My functions worked individually but couldn’t be chained together. The original zipper had been walked too far and just couldn’t take it anymore. What to do? hiredman helped me out on IRC with this: just make a new zipper. In my case that meant just calling zip/xml-zip again on the exhausted zipper.

There are some more gotchas that I’ve run into with zippers, especially when trying to edit trees, so I’ll take that on later. My feeling about zippers is that they are an extremely powerful tool that can be used in lots of unusual situations, not just parsing XML, but anywhere you have a tree that starts to get a little too complicated.

Trying Asciidoc

So about 50% of the posts here thus far have been about getting this blog going. This one is all about keeping that number up.

Two events have changed things. The most dramatic is the failure of the fan on my old Thinkpad X61. I am trying to figure out what to do still, but I am probably going to move on to a new model rather than trying to keep the old machine running. It has a few other issues and may not be worth reparing. No big deal really… except that most of my Octopress stuff was setup there.

The second “event”, so to speak, was discovering the existence of Asciidoc, while trying to build the PDF for Luke Vanderhart and Ryan Neufeld’s Clojure Cookbook. The idea of an improved kind of markdown immediately appealed to me. DocBook and LaTeX outputs seem like worthwhile goals. The purist in me always wants to be using the superior tool.

So, while setting up a new laptop environment on this tiny little notebook that I will be using for the time being, I thought: “let’s get asciidoc set up with Octopress”. There is Jekyll-asciidoc plugin, so away we go.

If you are reading this, then it works. If you aren’t, well, guess what.

How Clojure Helped Me Write Cooler PHP

Over the years, I’ve actually written quite a lot of code in PHP. I never chose PHP, I would never choose PHP except possibly when a client needs a CMS. Even now, look, no Wordpress!

But fairly often, I end up having to use PHP. I could go on and on about why I don’t like using PHP. But it is here and sometimes you can’t escape. That isn’t a reason to turn off your brain though.

One of the things that is great about learning new languages whose basic paradigms are different from what you use everyday is that you end up thinking differently. So sometimes you force even, yes, PHP to be more like the language you like.

So here is a little situation that I would have approached much differently before “getting” Clojure.

I had a fairly large “Array”, actually a map or a hash, with about 1200 words that mapped to other words. I suddenly needed to do reverse lookup on the same batch of words. I started going through the myriad “array functions” in the PHP docs and saw that there is array_flip:

array_flip() returns an array in flip order, i.e. keys from array become values and values from array become keys.

Perfect, I thought for a few seconds… Bu what about duplicates ? Well… not really PHP’s fault for this kind of function, but:

If a value has several occurrences, the latest key will be used as its value, and all others will be lost.

And I had duplicates. And I needed to put them into arrays so that some words could map to all of their corresponding words.

A few years ago, at this point I would have just started writing a loop. But, because of Clojure, you start thinking that there should be a way to describe the kind of data you end up with, or at least composing some transformations on the original map. Anyhow, this is what I ended up with:

// $this->dict is our original dictionary
$flipped = array_flip($this->dict);

// The second flip gives us something like a copy of the original
// dictionary, but without any duplicates
$nodupes = array_flip($flipped);

//  and now we compare the two  to see which keys are missing
$lostInTrad =  array_diff_key($this->dict, $nodupes);

// now we plug the missing values back into the flipped dictionary
// as arrays of multiple reverse translations
foreach ($lostInTrad as $key => $val) {
    if (! is_array($flipped[$val])) {
      $flipped[$val] = array($flipped[$val]);
    }
      $flipped[$val][] = $key;
  }
  return $flipped;

So, there is a loop at the end, but only over the duplicates, which in my case was a fairly small subset of the original batch of words. If I really wanted to pursue this, I would do an array merge with a callback.

How would I do the same thing in Clojure?

It was actually harder than I thought it would be. Here is what I ended up with:

(def dict { "word"        "wordA"
            "wordy"       "wordB"
            "wordy-word"  "wordA"
            "wordy-wordy" "wordC"})
(into {}
    (map (fn [[x y]] [x (map first y)]) 
        (group-by second dict)))

group-by is a pretty cool function that I had never used before. Once you’ve got that, most of the complexity is mangling the results back into the form we need. So not as clean as I expected but still fairly elegant because most of the hard work, the matching of duplicates, is done by a standard function.

My Virtual Workflow

Workflow and tooling in general is something that interests me a lot. As an independant one-man-show, tooling is an important question, mostly because I can choose exactly how I work, and because I have to choose exactly how I work. Neal Ford’s book, The Productive Programmer, which I read four or five years ago, also helped get me thinking harder about being efficient.

For my development workflow, I’ve become a pretty big fan of virtualization. For most of my projects I have a VMWare image that I can spin up on my desktop and use as a staging server. There are a lot of advantages to this. My primary motivation when I decided to go this route was to be able to separate Apache configuration files for different projects. This was a few years ago, but the Vagrant catch-phrase, “works on my box”, sums it up pretty well. You don’t want to get a couple weeks, or months, into a project before realizing that Project A was depending on a tweak that you did for Project B.

And that is definitely a worthy reason to go this route. But there are other advantages that I didn’t anticipate, so here they are:

  1. Version control This is pretty basic, but if you are livehacking your files on your production server, you are doing something wrong. At the same time you don’t want to have your production files be a Git repository. You need to separate things. A staging server does this.

  2. Automation When I put local virtual servers into my workflow, I had already started using Ant files to build and deploy my projects. Using a local staging server, instead of hacking directly into local files, means that between every edit, when I want to see or test what I’ve done, I have to go into a terminal and run my ant command again. This is a little fastidious (and I should really setup a key in Emacs that calls ant so that I don’t have to move around) but there is a huge, in my view, payoff, in that my deployment automation is constantly being tested. Since most of the build process is identical whether I’m deploying to the staging server or the production server, the system is being constantly tested. When I finally type ant deploy-live, I am rather confident that things will happen just like on the staging server.

    What I’m trying to say, I guess, is that you gain because the entire process is homogenous. Local and distant are the same; you really have one process and two or more targets.

  3. Branching Back when Git (or Hg or Bzr) vs. Subversion flamefests were rampaging across programming forums, one of the main points that people made about Git was that “branching was cheap”. Or easy. Easy, cheap branching makes our lives easier because we can just try something that might be a bad idea — but you never know until you start typing — with the certainty that if we mess up we can quickly and painlessly go back to a known point and start out again with clean slate. This possibility removes stress from the process.

    Being able to spin up multiple versions of your project, almost as easily as a Git branch, extends the same idea. “Hmmm, I’m not sure that will work, but let’s try it. I’ll just clone this machine and give it a whirl.” I had a situation like this where I had several versions of a CMS running on different instances, trying to figure out some bug that didn’t make sense.

But more than any particular technical benefit, the important thing about giving yourself quick access to virtual machine is how you think about machines. I remember my early days, before all of this, where changes to the server were risky and stressful, and resulted in crises and downtime. I felt like I was painting a watercolor, or carving marble: one false move and the whole thing is ruined. With virtualization in your life, you don’t think about servers the same way. They are just containers and you are used to setting them up and tearing them down. And because all of this is part of your day-to-day dev life, you are really well prepared when something bad does happen, and you have to move your servers or make some deep, scary change. Virtualization desacralizes the server, and that is good.

My Road to Clojure, Part 1

Like a surprisingly large portion of the people who program everyday, I am not a formally trained programmer. I was a French major and ended up with a doctorate in French literature from the Université Sorbonne Nouvelle-Paris 3. I happend to have some trouble with footnotes in Word and ended up deciding to write my disseration in LaTeX, then realized that Emacs with AucTeX was the best way to write LaTeX, and then that Linux was the best platform for running both LaTeX and Emacs. (Yes, I started with MikTeX and WinEdt on Windows 95, then moved up to ntemacs.)

I had a Makefile to compile my thesis, Perl scripts (and BibTeX of course) to generate my bibliography. By then it was too late : my mind had beens sucked into the machinery.

My first language was Perl. Everyone likes to make fun of Perl, line noise jokes, etc., but Perl has a lot of strengths. Remember that Ruby is called Ruby because Perl was called… “pearl”. Despite its flaws, Perl — at least what Perl with use strict; and regular reading of the Perl Monks — Perl, remains vastly superior, architecturally, structurally, to the language that succeded it, the unfortunate PHP. You can do a lot of gnarly things in Perl, but you can also do some clean functional programming.

Mark Dominus’ Higher Order Perl was a huge eye-opener for me. It is really an amazing book, especially for someone like me. Ironically, that book was my first step away from Perl. Because after being exposed to all those functional techniques and abstractions, you start to say: “why not do that in a language that was made for functional programing?” This is where Common Lisp comes in.

As an /autodidacte/, I felt I needed to learn something more substantive, that I was missing some CS power, and that Perl was not a serious enough language. I really didn’t want to learn C, but I did dream of compiling stuff… Python and Ruby were just getting started, and seemed too close to Perl to me, especially as far as how the runtime, non-compiling works.

A few key Paul Graham articles and I was on my way.