It was time for me to figure out a new backup system. To date, I’ve tried :

System Result

No backups


tar with tape drive

Could never quite get used to tapes

rsync + hardlinks

Worked well [1]


Configuration became an issue

Time Machine

Filesystem corruption

There are three boxes in the house that need to be backed up, two running Debian and one Macbook. This time around I wanted a solution that might be able to back data up to something “cloud”, so I started looking around. I was about to install Duplicity when I stumbled onto a completely different kind of tool, called ‘Ugarit’.

What is Ugarit?

Ugarit is an outlier among backup systems that adopts a rather novel approach to backups. It appears to be almost entirely unknown outside of the Chicken Scheme world. There is very little information about it out there, and I almost passed it by because it seemed so obscure. Well, it ‘is’ obscure really. There is no chance of finding people on Stack Overflow who have already asked the same question you want to ask. I was about to move on to the next option in my list of backup software, but the utter coolness of Ugarit got to me.

The coolness of Ugarit is its conceptual clarity. The basic pattern seems perfectly adapted to the problem domain, in such a way that many “features” are just the natural consequences of the initial conceptual choices.

Did you say obscure?

From a sort of aesthetic point of view, Ugarit is intriguing. It seems to be the work of someone (smart) who wants to do things ‘their’ way.

  • It is written in Chicken Scheme. This alone could be frightening for some. I’ve never programmed in Scheme, but the fact that this was written in Lisp was certainly not going to scare me away. Chicken Scheme compiles to C and runs on all the major OS’s, so this is actually a fairly nice choice of platforms.

  • The licence appears to be very “Free as in speech”, but is neither the GPL, nor the BSD nor any other known or named licence.

  • The author uses a version control system I had never heard of, Fossil, which is apparently Git-like but with added features (bug-tracking and wiki are built in).

But…​ the documentation is fairly clear and well written, and provides enough details to get going.

Content Addressed Storage

Ugarit is a backup system based on the notion of fuzzy:backup[Content Addressed Storage]. I won’t try to give a deep, technical explanation. The project documentation gives a pretty good overview, and I’ll come back to that. Basically, though, what happens with Content Addressed Storage, is that each of your files gets hashed and then stored at a location based on its hash. If your file hashes to “xyz123”, you would find it in the “123” subfolder of the “xyz” directory. The win, for backup, with this, is that if a file doesn’t change, its hash doesn’t change either, which means that there is no longer any point in storing multiple copies of the same file every month, or however often you happen to decide to do a full backup.

Because that is one of the problems with the traditional backup plan, where full backups need to happen every so often, due to the inherent fragility of chaining together too many incremental backups. As the Ugarit docs point out, most of our current backup concepts come from the world of magnetic tapes. Content Addressed Storage takes advantage of the possibilites of the low cost and easy access of live, spinning disks.

Another advantage with the Content Addressed Storage pattern is that duplicate data has the same hash. In other words, two (or more) identical files at different places in the filesystem, hash to the same thing and so get stored in exactly the same place. If multiple hosts are being backed up to the same store, they all benefit from this. If the same config file is present on 10 different machines, Ugarit only stores it once.

The difference between this kind of backup and traditional backups is similar to the difference between Git and Subversion (or Mercurial and CVS or whatever). In Git, there is no equivalent of a “complete file set”: everything is a patch. The system knows how to put your data back together when you ask it too.

libchop: a similar system

There is another tool in the same space, called libchop, or, more specifically, chop-backup. I almost chose libchop over ugarit at one point, but ultimately decided against it, first of all because the project seems more or less dormant, and because the backup process seems to be slightly more complicated. (You need to store a key for each snapshot, so the encryption is more secure, but you have the hassle of keeping track of snapshot keys.)


The documentation is quite good, and there is also a tutorial that walks you the relatively simple process which I won’t go over here in detail. Essentially, there are four steps:

  • Install Chicken Scheme

  • Install, via the Chicken Scheme installer, a couple of hashing libs and ugarit itself.

  • Create a directory somewhere for the “vault”

  • Make a ugarit.conf file (one for each client) with it’s own salt and key.

And that’s all.

The vault

The vault is just a directory. Once the data starts pouring in, it will end up containing a lot of directories whose names are hex numbers, from 0 to FFF. They contain similarly named subdirectories, and eventually we find some data. This is how it works: the hash of the file is converted into a path to that file.

About that key and that salt

The salt the user provides in the ugarit.conf file is what ugarit will use to hash the files it is backing up. That is straightforward enough. But what is that key for?

Encryption, of course. This was one of the things I was looking for in a backup system: simple encryption so that I could put my data in uncontrolled places, like the “cloud”. Ugarit doesn’t force you to encrypt your data, but it does offer the possibility right out of the box.

So far, I have just used this setup for storing to a local server, but my plan is to have a second data store. At this point, something like Dropbox might actually be perfect match for Ugarit. That would just mean having a “vault” inside a local Dropbox (or whatever) directory. Ugarit is efficient in terms of both network traffic (changes are the only things that get uploaded) and total storage space (nothing is duplicated). I haven’t tried it yet, but this seems like a perfect fit.

Alaric Snell-Pym has mentioned a future S3 backend. That would be even nicer.

Over the wire

As I mentioned, my current plan with Ugarit is to just get it saving files on my local backup server. This does imply getting data from the laptops up to the server upstairs.

I wanted to avoid having to setup NFS shares just for this. In my early rsync + hardlinks setup, the clients had scripts, run through cron, that tested for network connectivity, mounted the NFS share, rsynced, and unmounted. It worked but was fragile. Part of the reason I switched to Bacula was to solve this problem.

Ugarit is designed to work over ssh and does it very well. Here is some example config to show what I mean:

This is a line in the ugarit.conf config file on a client that defines the vault it is going to use, which in this case would be a directory on the same machine.

  (storage "backend-fs fs /backup/vault")

For a remote vault, all you do is this:

  (storage "ssh remote-host 'backend-fs fs /backup/vault'")

Ugarit still needs to be installed on the remote host, but you don’t even need a remote config file. And that is all!

Ugarit has an ‘explore’ feature that lets you browse through the snapshots in your vault, and through the files in each snapshot. With a remote backup server, this feature still works completely seamlessly over ssh. It’s very nice, considering that all it takes is a single line of configuration.

So far so good…​

I’ve been using this setup for a couple of weeks now, backing up two laptops, one Debian, one OS X, to a local server. Incremental backups are fairly fast, though not instantaneous, since Ugarit does have to look at a lot of files to decide which ones to back up.

My next step will be to automate the process, taking into account the fact that laptops are constantly going to sleep or offline. I’ll also have to figure out OS X’s equivalent of cron…​

After that I will take a look at the alternate backend that ships with Ugarit, the logfile backend, that stores data in fixed size blocks (like 1GB) and keeps track of files in a separate sqlite database. This might be an interesting way to store data on something like S3 while we wait for Ugarit’s author to write that S3 backend.

And, like I mentioned, I would like try Ugarit on Dropbox or something Dropbox-ish. So anyway, lots of fun so far, but more to do.

1 I can’t really remember why I didn’t just go back to the rsync and hardlinks method, which really is a rather nice, efficient system. When I set that up, it was entirely homespun, just based on some examples I’d seen. I think I probably wanted something more enterprisey.

Whenever I encounter some cool new technology, it goes immediately and unconsciously goes into either the category of “things I know I’ll never use”, or the category “maybe someday”. It is pretty easy to tell if something is part of even the widest interpretation of your personal problem space.

For a long time, Elasticsearch was on the second of those lists. And it even seemed as though anyone in the server-side “web space” should have a search solution in there quiver. But search is hard, and there are always more urgent things to do. Until now!

This week I finally started using Elasticsearch in a project. This article is going to be a n00b’s eye view of what getting started with Elasticsearch feels like, with a look at some of the kinds of problems it can help with. This is certainly not a tutorial! The code I’ve included is just for explanatory purposes and is not example code of any kind!

What it is

Briefly, Elasticsearch is a Lucene-based search tool (engine/platform/database/server/whatever). Lucene is the venerable Java based search engine. Solr is Elasticsearch’s direct competitor and does basically the same thing.

Traditionally, when choosing between Solr and Elasticsearch, one would be told that Elasticsearch was built to be distributed (hence the “Elastic” part). That difference may not be so anymore. Solr has progressed recently and uses Zookeeper. So that now they can say this on their website:

Built on tried and true Apache Zookeeper, Solr favors consistency and avoids hairy split brain issues common to other approaches


They’re talking about Elasticsearch there. There is lots more to read about this debate. You can start with Kyle Kingsbury’s (or @aphyr) Jepsen testing of Elasticsearch, and move on to this piece about Solr. The problems that show up with Elasticsearch here would be important if Elasticsearch was your primary data store, which I have indeed heard people recommend. If your data is elsewhere, like in a database, and Elasticsearch is just a search engine, you should be fine.

For me though, none of that made any difference, because all my data was going to fit in a single shard. The biggest reason that made me choose Elasticsearch, was simply that there is a solid Clojurewerkz library called Elastisch. As I started learning more about Elasticsearch, I realized that it has other features that make it an interesting choice for other reasons. Which brings us to my project…​

The problem

The goal was fairly simple. We have around 100,000 documents, web pages from the past 15 years or so. We also have a list of about 12,000 keywords that we are interested in and we would like to have an idea of the evolution of their importance over the years. We also know that the results won’t be perfect because matching keywords to texts is no simple task and because we aren’t ready to get into advanced keyword detection algorithms at this point.

The basic approach is pretty simple:

  • Index the 100,000 documents.

  • Search them using the keywords.

  • Generate a report with the totals for each month over the past 15 years.

Looping through each month since 2000 and running a search on a keyword seems ungainly and possibly too slow (although who knows, with a data set this size). This is where some of Elasticsearch’s features become really useful.

Getting started

This isn’t a tutorial, at all. I will mention some of the difficulties I had along the way though.

I first installed Elasticsearch from Elasticsearch’s apt repository. This installation was simple enough but Elasticsearch would silently die whenever I started it. /etc/init.d/elasticsearch stop would report that a PID was found but that there was no running process. Is this because Elasticsearch runs as java? I don’t know. I moved on to the manual “install” which, at least for development purposes, amounts basically to untaring the tarball and running bin/elasticsearch.

Interaction usually happens via http. The examples all use curl. Since I am doing this in Clojure, using Elastisch, the Clojure REPL is also a great way to interact with the search engine. For Emacs people, there is also es-mode, which is really nice because you can have a file full of different Elasticsearch requests and then type C-c C-c to send one of them to the server. Highly recommended!

Keyword mechanics

Finding keywords

Identifying keywords in texts is obviously a huge problem. The keywords I had to deal with were more than just simple words. Here is a sample:

  • Sciences humaines

  • Sémiotique des médias

  • Proust, Marcel

  • Vigny (de), Alfred

  • Empire

  • revue Culture et musées

Exact matches would not necessarily be useful. What if someone writes “`empires`” or “`une science humaine`” or “`les sciences, mêmes humaines`” (which would mean something like “sciences, even if they are human sciences”)? I actually wrote an initial version of this project that just used straight up string matching. It worked, in the sense that it produced somewhat meaningful results, for some keywords. Others would disappear completely. In the list above, something like “revue Culture et musées” might not work well, because the first word, which means “journal” might often be omitted.

These are classic human language parsing problems. The nice thing about using a tool like Lucene, is that a lot of this has been dealt with already. As you may have noticed, these keywords are in French. Lucene knows how to index French. Just tell Elasticsearch that a field is in French.

{ :body        {:type "string", :store "yes", :analyzer "french"}
  :title       {:type "string", :store "yes", :analyzer "french"}}

(Note that I’m writing this in Clojure. Classic Elasticsearch is in JSON, so the Clojure keywords get converted into strings.)

Fast forward: we’ve pulled the 100,000 documents out of their MySQL table and indexed them. I can run searches against them, I can find stuff, all very cool. But search, as such, is not the primary goal here. At this point I was reassured to be moving forward with Elasticsearch but unsure about how I was going to get my keyword data.

In comes the Percolator.

The Elasticsearch percolator reverses the typical relationship between a query and the document that is found. With the percolator, queries are stored ahead of time, as documents. Then we can send a document, I mean like a real document, and see which of the pre-selected queries match. It’s like a backwards version of Google, where you would type in a URL and get back a zillion pages of queries that would have matched that web page. You use a document to query the queries.

The idea is to see which of our keywords might pertain in some way to the documents. We are using a document to search for keywords.

The basic setup is simple enough. Here is my function for installing an individual keyword as a percolator document:

(defn register-tag-percolator [es-conn tag tagnorm id]
    (str "tag-" tagnorm "-" id)
    :query (t/query-from-tag tag)))

nperc refers to the clojurewerkz.elastisch.native.percolation namespace. (The native part means that we aren’t using the RESTful interface, but the native Java interface. This is a (small) advantage of Clojure’s Java interop.) “ourindex” is the index where all of our data is already stored. The argument after that, (str "tag-" tagnorm "-" id) is the name of the percolator query document that we are storing. Our keywords have a normalized form and a numeric id, so we are just assembling a human-readable string. This will be important later though. The t/query-from-tag function is our own function. It will build an Elasticsearch query object from the actual tag string. More about that soon.

So we loop over our 12K keywords with this function. Now we can throw a document at them and see what happens. Here is our function for doing that:

(defn percolate-doc
  [es-conn doc]
  (nperc/percolate es-conn "ourindex" "page" {:doc doc}))

I agree, there isn’t much to see here. I’m showing this function just to point out that we can percolate any arbitrary document, even if it isn’t in the index yet. The doc argument to this function can be any JSON object with the same fields as the other documents. This is very important, because it is going to allow me to grab the keywords before I index the document, then include the results of that work in the document.

Now I can send a document to the percolator and get an answer like this

{:count 117,

... etc.)}

And at this point (assuming I’m satisfied with these results, but I’ll get back to that), I can take these 117 keyword matches and make them part of the document before I index it. Here is the function that does it:

(fn [r]
  (let [page (page-to-es r)
        {tags :matches} (percolate-page es-conn page)
        clean-tags (mapv (fn [t]
                           (last (str/split t #"-" ))) tags)]
    (esnd/create es-conn "ourindex" "page"
      (assoc page :autotags clean-tags))))

This function is the :row-fn argument to a query call. That means that r is a row in an SQL result set.

So what’s going on here? page-to-es is formatting the database row as a page document for the Elasticsearch index that we’ve already defined. The next line, with {tags :matches}, sends the document to the percolator to get a list of keywords, like we just saw. We then map over that to grab just the numeric part of the tag names, and finally we index the document, associng in a vector of keyword ids as the :autotags field.

Aggregating it all back together

Our lists of keywords that are now part of each document that we put into the index are not going to improve search results, since they already are search results. The reason we are storing them this way is that we want to look at the entire collection at once so that we can graph the frequency of different keywords over time.

This is where Elasticsearch’s aggregations come in. Here is how we get our report back:

GET /ourindex/page/_search?search_type=count
  "query": {
    "match": {"autotags": "8442" }

      "date_histogram" : {
        "field" : "date",
        "interval" : "month",
        "format" : "YYYY-mm-dd"

(You’ll notice that I’ve switched to JSON notation here: no particular reason except that I haven’t coded this part into my application yet.)

Running this query produces something like this:

  "_shards": {
  "hits": {
  "aggregations": {
      // etc.

That little bit of query code was enough to give us a doc_count for each month. The rest, as they say, is histograms!

It works but does it mean anything?

So far I’ve basically been telling the story of a proof of concept. It was very reassuring to see that Elasticsearch’s features could get us from our raw data to the output that we needed in very few steps. And, without caching anything, that is, without building up an intermediate index somewhere from which we would have generated our statistics. Once the keywords and the documents are indexed, the report generation is live. From a nuts and bolts perspective, this part was very satisfying.

But…​ does it mean anything? How good are the results?

This is what I’ve learned so far, when working with “medium data”: it is easy to draw a nice looking chart, but how do you know that you are measuring something meaningful? In this case, the question boils down to the quality of the matches in the percolation phase. Are those 117 keywords relevant to the document that they matched?

I’ve glossed over some important details in telling this part of the story, and now I want to come back to some of the other problems that I had to solve along the way. In the percolating function that I showed above, there was a query-from-tag function. This function’s job is to take a string like “Vigny, Alfred (de)” and turn it into a useful Elasticsearch query.

There are several problems here. Most of the time, this nineteenth century poet’s name will just be written as “Vigny”. The first problem is that, because of the way that Elasticsearch (and Lucene?) analyze the documents that they index, “Vigny”, a very distinctive proper noun, can be also find documents with vigne (grapevine). As far as I can tell, the question of whether to analyze a text field or not is an index-level decision. Since I want full text search on the text body of each page document, it has to be analyzed; that means that the search query will be analyzed too, and so vigne will be considered as an equivalent of “Vigny”. There might be a way to get around this but I did not find one.

So what about using “Alfred de Vigny” instead? A query could easily be written where “alfred” and “vigny” would both have to be present to get a hit. Except that would mean missing all the occurrences of “Vigny” by itself. So we can’t do that.

The next logical approach would be to give a higher score to matches on the complete name and then maybe eliminate the lower scoring hits. This is where we run into something that is, in my opinion, one of the major limitations of the Elasticsearch percolator for this kind of problem. The results of a percolator query, ie. throwing a document at 12,000 queries to see which ones stick, is that, unlike in traditional search queries, the results are not scored. It is binary: hit or miss.

On my first try, most of the documents were getting between 700 and 1000 keyword hits. By delving into the way that Elasticsearch queries work, I was able to fine tune the matching to a certain degree, ending up in the range of 100 to 150 hits per page. If a person were to manually assign keywords, it would probably be more like ten to 30 keywords for a document. I don’t know much about machine learning, but something tells me that a discrepancy of about an order of magnitude between human and machine keyword guessing might not be so bad considering how little work actually went into setting this up.

I’m sure that there are technical reasons why this is true. A percolator query is a strange thing, and this version of the percolator has only been around for a short time. The results can also still be useful. As you can see in the sample results, the document got hits from about 1% of the available keywords. This is enough to give a rough idea of the evolution of certain concepts over time, but we have to remember that the results are rough.

If we look back at the list of sample keywords I gave at the beginning, there was this one: “`revue Culture et musées`”. It is the name of a journal. It will receive an incredible number of false positives because all three of its terms are going to be present in various configurations in a lot of documents, most of the time not referring to the journal in question.

For our purposes this is not a major issue. The idea is to get approximate results with as little human intervention as possible. If we really needed higher quality keyword matching, one solution would be to simply invest more time in the curation of the keyword queries, possibly writing some or all of them more or less by hand, or adding metadata so that we can tell the system that the word “`revue`” needs special treatment in the query. Elasticsearch provides a fairly elaborate query DSL that would permit this.

And as a suprise bonus…​

One positive result of this work is that, thanks to the machine generated keywords, we are going to be able to provide a limited set of keyword suggestions for our users, who will be able to manually tag pages. In my opinion, this alone might make the whole project worthwhile.

Not to mention that we could even use this thing as a search engine if we had to.

I’m working on a project where I need to extract textual data from a lot of HTML. Today I was working on a way to normalize my text input by parsing the HTML, especially the entities, and outputting raw text. I started to look around for a library to do this and everything seemed to point to Christophe Grand’s Enlive library.

This library was not unfamiliar to me, because I’ve used it for templating in a Ring app. I also knew that it could be used for webscraping, but I had never really paid attention to that aspect.

So I’ve got my library, I’ve got my data, and I start trying to learn how to break HTML down into simple text. I do this (en is for enlive here):

(en/html-resource  ( "<em>stuff</em>"))

and I get this:

({:tag :html, :attrs nil, :content
       ({:tag :body, :attrs nil, :content
                     ({:tag :em, :attrs nil, :content ("stuff")})})})

Now that data structure looks kind of familiar. It should, since that is the standard way of representing XML trees in Clojure.

Suddenly something clicks deep in my mind. “We can use a zipper!”

And indeed, after require-ing :as zip, we can do this:

(zip/xml-zip (first (en/html-resource  ( "<em>stuff</em>"))))
[{:tag :html, :attrs nil, :content
       ({:tag :body, :attrs nil, :content
                     ({:tag :em, :attrs nil, :content ("stuff")})})} nil]

And now we are off to the races! We have our handy zipper toolbox at our disposal. (I wrote three articles about zippers a few months ago: one, two and three.)

The last of the three articles provides a pretty good pattern for this situation. Here is what I ended up doing:

(defn normalize-text
  "Extract text: lower case everything, parse html, remove tags."
  (->> text
    (iterate zip/next)
    (take-while (complement zip/end?))
    (filter (complement zip/branch?))
    (map zip/node)
    (apply str)))

The use of the →>> macro here might make this slightly unintuitive if you aren’t used to it. Note how, serendipitously, all the forms here are either single-argument functions like first, or multiple argument functions where the key argument comes at the end. I guess today I did feel lucky.

Anyway, let’s break this down. The first few lines are what we’ve already seen, parsing HTML with Enlive’s html-resource and making an xml zipper. Let’s take it from there.

;; we make the zipper
(def hzip (->> "<p>sample <em>text</em> with words.</p>"
                  zip/xml-zip ))

;; and we walk through it and grab the text
(apply str (map zip/node
                   (filter (complement zip/branch?)
                     (take-while (complement zip/end?)
                       (iterate zip/next hzip)))))

;; with the result
"sample text with words."

How did that work? The take-while and iterate parts walk through the tree, lazily. We filter out all the branch nodes, which in this case means everything that isn’t…​ text. Which is what we are looking for. At this point all we have to do is map over the locations (remember, we are in zipper-space where everything is a location containing the entire tree — see my other articles about that) with zip/node to get the individual textual payloads. Make it all into a proper string by applying str and we’re done.

Once again, nothing particularly amazing about all of this, though I was pleased with the elegance and brevity of a solution that I was able to cobble together rather quickly. There are other ways of getting the same results too. But this does show how a little bit of familiarity with Clojure’s zipper libraries can be surprising useful.

So I’m working on something I have some percentile values for each month over the past ten years. I need to write an endpoint so that my javascript, which will eventually be some D3 histograms, can easily grab that data.

The way my tables are set up, there aren’t any zero values. If I have zero occurences for a given month, there is simply no corresponding row. This means that for my histograms, I need to fill in the missing months. And I start thinking about how I should do that: server side or in the browser, looping over all the possible months, checking to see if we have data etc.

There isn’t anything particularly difficult about this, aside from the fact that we don’t particularly enjoy touching dates. It is just kind of tedious and possibly easy to get wrong.

And then suddenly I think: SQL!

After all, we are basically just taking a database query, munging it and sending it down the road.

Here is the query the way I wrote it the first time:

select tm.month as month, p.percentile as percentile
from tag_month_percentiles p
join tag_month tm on = p.tag_month_id
join tag t on tm.tag_id =
where tm.tag_id = ?
order by tm.month asc

What I needed was an outer join to produce lines for all those “missing” months.

After fussing around a little, here is what I ended up with:

select tm.month as month, IfNull(peep.percentile, 0) as percentile
from tag_month tm
left join
     (select tm2.month, p.percentile
     tag_month_percentiles p
     join tag_month tm2 on = p.tag_month_id
     where tm2.tag_id = ?) as peep
on tm.month = peep.month
group by tm.month
order by tm.month asc

I don’t do this often enough for it to be second nature but in the end it is pretty simple. The original query becomes a subquery and we do a left join against the entire table. All the empty months, that disappeared with the original inner join, reappear as nulls. And we use IfNull (this is MySQL - not my fault) to replace the nulls with zero in the final results.

Because this is coming directly from the data source, I don’t need to figure out when my list of months should begin or end. All the months are automatically where they should be. I don’t have to think about it anymore.

There is nothing earth shattering here…​ but it does show how with about five lines of code, and the declarative nature of SQL, I avoided writing many more lines elsewhere, and avoided all the possible bugs that could have crept in.

Over the past couple of months I’ve been working on a web app, writing it in Clojurescript and using David Nolen’s Om framework. I may end up writing about some detailed aspects of Om, but for now I would like to talk a little bit about what it is like to build an Om app. This isn’t meant to be a tutorial at all, or an example; I just want to give a sense of what it feels like, and what some of the harder things to grasp are.

(If you are unfamiliar with Om, I recommend reading David Nolen’s blogpost, and listening to this Javascript Jabber episode. Or watching David Nolen’s talk at the last Clojure/West.)

My previous experience with Javascript frameworks was mostly with Backbone.js. And before Backbone existed, at least publicly, I already could sense that I needed some kind of tool to abstract the relationships between data (what would eventually become models in most frameworks) and the DOM. On one project, I wrote a kind of abstraction layer that tried to separate the data from the view. Everything was based on anonymous functions that got passed around. It worked, but when I came back four years later to debug it, I realized that it was completely incomprehensible. The only consolation I got was that because data and presentation were more or less separate, it only took about a day and a half to shoehorn it into Backbone.

Anyway, I digress…​

React and Om (from here on out I will mostly just say Om, but the two are obviously intertwined), change all of this. In my opinion, in fact, Om coupled with Clojure’s core.async, basically renders the MVC terminology obsolete. Not that we don’t have models, views and controllers anymore; it’s just that the layering of the three is no longer relevant. I would argue that in Backbone at least, the V and the C are already blended together, or that what Backbone calls a View is really a Controller, and that the DOM is in fact the View. Other frameworks handle this differently, and people have talked about MVVM as well. From what I know about other, more recent frameworks like AngularJS and Ember, they seem to be (huge) improvements over what Backbone started, but that they don’t really change the way we think about all of this.

Om (and React) are definitely view-centric. However, I would argue, for the time being anyway, that Om (with core.async) determines many other aspects of the process in a way that traditional Model or Controller layers are no longer relevant. In other words, this is not just a View layer is search of an M and a C. It is a different way of thinking about the whole process. And because it is so different I thought it would be worth going over some of the fundamental ideas, to perhaps alleviate some of the disorienting strangeness that we run into when we learn new patterns.


If Om is so different, it is because the abstractions the developer has to deal with are rather different from what we are used to. And the React component lifecycle is the abstraction that is probably the hardest to understand, as well as being the abstraction that you have to deal with constantly.

By lifecycle, they mean the stages of how a given thing (a component) appears in the DOM, changes and perhaps disappears from the DOM. Every framework tries to take care of certain aspects of the task at hand so that we don’t have to think about them anymore. A big part of what Om does as a framework is moving your data through these different states.

Understanding what is happening here is therefore crucial, but difficult, especially if you, like me, don’t have any direct experience with React. And of course part of the point of using a library like Om is to not have to know React that well to get going.

So here is the complete lifecycle:

  • IInitState

  • IWillMount

  • IDidMount

  • IShouldUpdate

  • IWillReceiveProps

  • IWillUpdate

  • IDidUpdate

  • IRender

  • IRenderState

  • IDisplayName

  • IWillUnmount

These protocols are implemented by functions in Om components. A component must implement one of the rendering protocols, the rest are optional.

This list makes things look complicated, but we can actually pare this list down considerably. For example IRender and IRenderState are conceptually identical; IShouldUpdate is something that Om takes care of by itself, but could be an issue in pure React; IDisplayName exists really for debugging only.

With a little simplification, we really have four or five concepts to worry about. The basic lifecycle starts with an initial state, ie. IIinitState. When our component comes to life, it first needs to be mounted into the DOM. Hence: IWillMount, IDidMount and IWillUnmount, when the component is finally destroyed.

Rendering is the heart of this process, at least from the programmer’s point of view, because that is where we build the HTML that the user will eventually see. An important difference with other frameworks is that rendering happens a lot. For an animation, like moving a <div> from left to right, the IRenderState function might be called several times a second. Sometimes it is better to think of your app like a movie (a cartoon, I guess) rather than a static set of objects sitting around waiting for events to happen. And if you are thinking: “that must be horribly slow”…​ well, it isn’t, thanks to React’s virtual DOM that efficiently projects changes out into the real DOM.

So, back to the lifecycle, IRenderState basically gets called whenever something changes. Besides mounting and unmounting a component, there are also the “change” phases: IShouldUpdate (which we ignore, because in Om the components “know” if they should update or not), IWillReceiveProps, IWillUpdate, and IDidUpdate. The will and did parts are fairly simple: do we need to pre-process that new data coming in before updating (and rendering)? After updating, is there something we need to do, like grabbing the DOM node that might have changed, to get its dimensions, for example? They work like hooks in other systems.

I like to think of this as a river of data flowing down towards the virtual DOM. These protocols and their corresponding functions are how we guide that data to the right place and mold it into the right form.

And so what about the data?

React (and Om) are based on the idea of one-way data binding. Change happens in your data somehow, and the components in the app translate that into a new representation in the DOM. This is why the conceptual model is so appropriate for a functional programming approach, and so appropriate for Clojurescript with its immutable data structures: in a very broad sense, there is data on one side, then a series of functions that transform it, and finally DOM output on the other side. This gets us away from the traditional UI model where lots of little pieces talk to each other all the time, and state is intertwined (or “complected”) with the interface itself.

Of course, the details are a little bit more complicated than a pure data in, DOM out model, partly because we do need information from the DOM to work its way back up the river.

Om (and React) have two kinds of data: application state (which React calls “props”) and component state. This is an area that trips people up because it is not always clear what kind of data should go in what kind of state. More on that in a second.

Application state

In Om, all application state is generally contained in a single atom that has a tree structure composed of maps and vectors. Om uses something called a cursor to access individual parts of this tree. David Nolen has compared cursors to Clojure zippers. The interface to cursors is nohwere near as elaborate as zippers, but the idea is indeed similar: keep a path to wherever the data you are interested in is located.

For example, if you have a menu with a submenu in your app, you might reference it like this: [:main-menu :sub-menu].

The reason this is important is that most components will only have access to one or a possibly a few parts of the overall tree. In a more traditional javascript framework, we would say that some part of the View was “listening” to that part of the model. This would not be accurate with Om (or React), where it makes more sense to say that changes are being pushed out from the application state (the atom) to the view and ultimately to the DOM.

It’s worth noting that, in Om, both application state and components tend to be organized in a tree-like manner: components branch out into sub-components just like application state does. This pattern works well with the top down, data-pouring-through-the-ap approach.

Component state

The other part of the state picture is component local state. What should be local is sometimes a tricky question, but as a starting point, it might be helpful to think that component state tends to be more related to the mechanics of making the component work. For instance, if two components need to communicate with each other via a core.async channel, the endpoints of the channel would belong to each component’s local state. The other classic example is an animation where an offset is being incremented on each render; that offset doesn’t need to belong to the app state. It is just a “mechanical” part of making the component do what it should.

Back to application state, with an example

Application state, on the other hand, deals with what you are trying to represent. This can still be tricky, depending on what your definition of “what” is…​

In the project I’m currently working on, most of the actual content can be either in French or in Latin, and the user can choose to toggle between the two languages inside a given component. So at first, I thought this sounded like component state, since it was a question of choosing two different representations of the same time, and because all that data was already available to the component.

This quickly started to break down though, because pretty soon I wanted components to behave differently depending on what language was displayed by nearby components. I started setting up channels to tell the neighbors about changes to language state and everything started to get complicated and spaghetti-ish. I finally realized that my mental distinction between component state and application state needed some adjusting.

It turns out that the French/Latin language choice is really part of what the app is supposed to be showing. It isn’t an implementation detail, so it goes into the app state.

Earlier, I mentioned a menu and a sub-menu as being part of application state. In some circumstances, we could imagine that the contents of those menus might be derived from other information in the app. A menu isn’t so much a “thing” to represent as a tool within your application. However, since it is an entity that is part of what your are trying to show, it probably deserves its own piece of app state real estate. Whether a collapsing menu is visible or not might, on the other hand, be a suitable candidate for component state…​

At any rate, this isn’t meant to be a complete discussion of the topic, but just enough to give you an idea of how our thinking has to change when using Om.

Back upstream

React is supposed to be fast, and Om possibly even faster. But that implies interaction, and so far I’ve been describing a great river of data that lands in the DOM. How does change occur?

The simplest case would be a list item with a button that adds a star to the item when we click it. The list item would be a component, with the button part of the same component. The button would have a handler that would look a little like this, but not quite:

(defn star-handler [ev]
      (.preventDefault ev)
      (om/transact! list-item (fn [li] (assoc li :star :true))))

list-item refers to the part of the application state that is at the origin of this particular list item. om/transact! takes a function that operates on list-item, thus returning a new version with :star set to true.

What is nice here, is that application state gets changed immediately. If our render function is set up correctly, it will render a star now that :star is true. If there is a star counter somewhere else in the app, that keeps track of how many items are starred, it would be incremented immediately. Without any even listeners, except for the one listening for the click itself, these changes can be reflected everywhere in the app.

Now, this is probably the simplest possible case, because the state change concerns the item itself. If we wanted to remove the list item, instead of modifying it, we would have to do that from the parent component, or, to be more precise, the “owner” of the component. Unlike the DOM, “ownees” can’t access or manipulate their owners. Data just flows one way. So if we need to communicate with the parent or owner, to tell it to remove a child element, we can establish a core.async channel to tell it to call om/transact!, and everything will just work out. core.async allows us to jump back up the tree when we need to.

The same thing holds for new data coming in from the server. It goes into a channel and gets directly integrated into application state, and whatever changes need to be made can just flow back out through the rendering functions.

This is also why I was saying that Om (and React) are really much more than a View layer: Om has its own state management in the form of the state atom and cursors. It isn’t quite a model layer, but anything missing, like server sync, ultimately can just be added on as needed. The same is true of what would be the Controller: to the extent possible, you just want to write functions that react to input and modify data. In other words, while Om doesn’t provide everything you need to build a complete app, it is more than just a brick in a bigger edifice, because it imposes an architectural model that determines how you set up your “models”, your “controllers”. That is, to the extent that it is still relevant to talk about models and controllers in this context.

Next time, or maybe the time after that, I think I’ll talk about some of the things that are indeed somewhat difficult with Om.

Thus far, in the first two parts of what is turning into a series, when walking Clojure zippers, we’ve been mostly using a pattern like this:

(require '(clojure [zip :as zip]))
(defn recursion-for-recursion
  "Walking around doing nothing."
  (if (zip/end? loc)
    (recur (zip/next loc))))

Now, obviously, this function wouldn’t do anything except run the zipper out to its :end. But that has been our pattern so far.

Recursion smell

But this is Clojure and every use of recur should make you wonder for a second whether or not you can accomplish the same thing using a sequence. Take a look at this function, that I ran into here:

(defn leafnodes
  "Returns all leaf nodes in loc. "
  (filter (complement zip/branch?) ;filter only non-branch nodes
          (take-while (complement zip/end?) ;take until the :end
                      (iterate zip/next loc))))

This could be rewritten with our familiar pattern:

(defn leafnodes-r
  "Recursive version of leafnodes."
  [loc accum]
  (if (zip/end? loc)
    (recur (zip/next loc)
           (if (zip/branch? loc)
             (conj accum loc)))))

It is interesting to see these functions side by side, or at least one above the other. (iterate zip/next loc) does the actual (lazy) recursion; (take-while (complement zip/end?)…​) limits the recursion so that we stop at the end. By this time, it is pretty clear that we are getting a seq of locations that we could do basically anything with. filter is used here to take only the leaf nodes, which is what we get in the end.

Let’s try some stuff:

(def zzz (zip/vector-zip [:a [:b [:c :d] :e]]))
(map zip/node (leafnodes zzz))
; (:a :b :c :d :e)

Mission accomplished: all leaf nodes. What about the recursive version?

(map zip/node (leafnodes-r zzz []))
; (:a :b :c :d :e)
(= (leafnodes zzz) (leafnodes-r zzz []))
; true

;; just for fun
(flatten (zip/root zzz))
; (:a :b :c :d :e)

That’s fine then, it works. The powerful thing here is that we now have the full combined firepower of the seq and zipper abstractions for extracting information from a tree.

Parsing HTML

Let’s have some fun with this pattern playing with everyone’s favorite form of XML. We are going to play around with extracting the links from an HTML page. Here is our namespace declaration:

(ns info.josf.examples.zipper-seq
  (:require [ :as zip]
            [ :as dx]))

I’ve actually put all these examples into a Github repo, which could be helpful if you want to follow along in your REPL.

First, we grab a page and stuff it into a zipper:

(defn zappy
  "grab a webpage we can trust, and parse it into a zipper"
  (-> ""

There are three simple steps here: dx/parse wants a stream, hence It parses the page then we pass that on to zip/xml-zip, which is the specialized zipper-builder for XML. Even here there is the node-space/zipper-space thing going on: the first function builds the actual tree, and zip/xml-zip wraps it up as as zipper. Now let’s do something.

(defn get-links
"Returns a list of href attributes from :a nodes in an xml zipper."
  (map #(get-in (zip/node %) [:attrs :href])
       (filter #(= :a (:tag (zip/node %)))
               (take-while (complement zip/end?)
                           (iterate zip/next loc)))))
(get-links zappy)
; ("index.html" "index.html" "#" "index.html"
; "branch-1.1.x/index.html" "index.html" "api-index.html"
; "#" "clojure.core-api.html".... etc.)

There is almost nothing to explain here. It is the same basic function as before, except we’ve adapted the filter predicate to work with an XML tree, and we’re using map to grab the :href value from each node.

If you’ve never dealt with XML trees in Clojure before, this might look slightly confusing. The basic idea with this kind of tree representation is that each node is a map with three keys. ( actually uses records; non-text nodes are represented as Elements, but it amounts to the same thing.) It looks like this:

{:tag :a,
 :attrs {:href ""}
 :content ("The link text")}

In our function, when we say (map #(get-in (zip/node %) [:attrs :href]), we are just grabbing the href out of the node’s attrs map.

This is a pretty basic task: walk a tree, grab some data, report back with a list. Any language can do this of course. Clojure zippers let you do it fairly simply, and in fact this little function seems at least as simple as something comparable in XPath/XSLT.

A contrived twist

Now let’s try something a little crazier. We’ll parse the same document, but this time we want more than just the href values: we also want to know the depth, relative to the document root, of each link, as well as how many sub-nodes there are in the <a/>. For a given link, we want something like this:

{:href "index.html" :depth 5 :subnodes 1}

Step one: just a seq

The get-links function returns a seq of URL strings, but for this we need locs, so we have to simplify the function:

(defn get-a-zips
  "Return a lazy seq of :a locations"
  (filter #(= :a (:tag (zip/node %)))
          (take-while (complement zip/end?)
                      (iterate zip/next loc))))
;;; Let's run put the result in a var. Printing the result
;;; will eat up a lot of space in your REPL.
(def a-list (get-a-zips zapi))

This is the same function as before, less the mapping that extracted the URLs. It is just going to give us the bare list of locations. The rest of the work is coming up with a function to map over this seq.

Step two: ancestor count

Suddenly we realize that each element in the list of link node zippers has access to the entire tree. Even though it feels like we are pulling the tree apart, filtering out everything but the parts that we are interested in, each part is actually one of those grains of sand that contains the entire universe (so to speak).

Now, we all love Clojure’s seq interface, but there is something that can trip you up. Or at least that is what happened to me. I’ve talked a lot about the distinction between zipper space'' and node space”, because, in my experience anyway, this is often the difference between getting it or not.

So when we then start thinking about all of this through the seq interface, it can get a little bit more confusing: we’ve got a seq, but a seq of what?

I expected to get a bunch of nodes. This is such a common pattern: walk a tree, look at the nodes, accumulate data, do something. What is really cool, and possibly confusing, about Clojure zippers, is that often you don’t get nodes, you get locations. That means that you aren’t extracting data, you are locating it in the tree. Each element of the seq has access to the entire tree. Each element that gets returned is still a zipper. (Unless it’s not, of course: functions like zip/children return nodes, not locs.)

So, given a loc, how do you count its ancestors? Well, you write a function.

(defn ancestor-count
  "How many ancestors do I have?"
  (count (take-while zip/up
               (iterate zip/up loc))))

This is the same pattern as before, except that we iterate using zip/up, and we keep going up until zip/up returns nil, which is what can only happen at the root node.

Let’s take it for a spin:

(map ancestor-count a-list)
; (4 5 7 9 9 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 11 12 12 12 12 12 12
12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 4)

Looks good to me…​ except wait a second! We can do the same thing in fewer characters using the zipper’s path:

(defn ancestor-count
  "How many ancestors do I have?"
  (count (zip/path loc)))

Step three: how many kids?

For simplicity’s sake, we’ll just count the total number of descendant nodes. This is one of those times when using a sub-zipper seems like a good idea. By that I mean, as I discussed in my first post on zippers, that zippers are cheap to make, and that it can be useful sometimes to turn a given node into a “sub-zipper”.

(defn descendant-count
  "How many descendants do I have?"
  (let [sub-zip (zip/xml-zip (zip/node loc))]
    (count (take-while (complement zip/end?)
                       (iterate zip/next (zip/down sub-zip))))))

(map descendant-count a-list)
; (1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1)

Not the most exciting results, since all the links on the page apparently only have text content. You get the idea though. We could have counted siblings, for example, or done any other kind of calculation.

Step four : all together

Just for consistency we need one more little function to grab the href as well:

(defn extract-url
  "Get the href value from an <a> zipper. Takes a loc."
  (get-in (zip/node loc) [:attrs :href]))

And now we can do what we set out to do:

(defn depth-href-kidcount
  "loc-seq is a seq of <a> node zipper locations."
   (fn [loc]
    {:depth (ancestor-count loc)
     :href (extract-url loc)
     :subnodes (descendant-count loc)})

Deep breath…​

(depth-href-kidcount a-list)
; ({:depth 4, :href "index.html", :subnodes 1}
; {:depth 5, :href "index.html", :subnodes 1} {:depth 7,
; :href "#", :subnodes 1} {:depth 9, :href "index.html",
; :subnodes 1} {:depth 9, :href "branch-1.1.x/index.html", :subnodes 1}
; ...etc.)

Simple as that!

Obviously, this was a contrived example, but I hope that it shows how there are a lot of creative to mix the seq and the zipper interfaces, and that it can be a powerful way to work with your data. In my opinion, zippers aren’t really that complicated, but they do require us to adjust our thinking. Once we do, there are lots of fun things to do.

There is one inherent limit to this pattern, and that is that there is no way to systematically edit the tree once it has been iterated into a sequence. All the locs in the are snapshots of, or references to, the exact same tree. Any cumulative editing will just produce a sequence of new trees, each one modified once, with no way to combine all of the edits into the same tree.

This is a follow-up to my Getting acquainted with Clojure zippers post. That was more of an introduction, this article is going to more about some of the difficulties we run into when using zippers to modify tree structures. We are going to try to write a function to modify a tree, and along we will run into lots of problems that, once we deal with them, will eventually lead to zipper enlightenment. The code here isn’t meant as recipes or a perfect example, but just a way to work through some issues that come up when working with Clojure zippers.


The neat thing about zippers in Clojure is that they allow tree manipulation to be purely functional, Zipper manipulation functions can be pure : given the same zipper they will always return the same thing. This is true of movement functions like zip/down and zip/next, since position inside the tree is part of what we are feeding into the function: given the same tree, and the same position inside the tree, the result will alway be the same. The tree itself, obviously, doesn’t change just because we’ve moved. The zipper here is an abstraction that separates the tree itself from the notation of our location in the tree, and thus allows us to deal with both explicitly.

…​but editable?

What about modifying the tree, though? What happens then?

It turns out that simple edits are simple. There are several functions in :

  • edit

  • append-child

  • insert-child

  • insert-left

  • insert-right

  • remove

  • replace

zip/edit is interesting because its arguments are a location and a function to modify the current node. This kind of editing could go rather far, since the current node can be the top of subtree. In a simple case, though, it could look something like this:

(require '(clojure [zip :as zip]))
(def zzz (zip/vector-zip [1 [:a :b] 2 3 [40 50 60]]))
; #'user/zzz
(-> zzz zip/down (zip/edit inc) zip/root)
; [2 [:a :b] 2 3 [40 50 60]]

The first node in the tree, the 2, has been inced up from 1. Editing is possible!

In an XML based tree we might have more complex editing needs. XML trees are map-based. Each node is a map with a :tag key, an :attrs key and a :content key. If we wanted to change a <ul> to an <ol> in an HTML document, the call to zip/edit would look something like this:

(zip/edit loc #(assoc % :tag :ol))

We’ll stick to vector trees here, in the interest of simplicity and short codes snippets.

Let’s try inserting a node now. We’ll try to put a :c after the :b in the first subvector.

(-> zzz zip/down ;; we are at 1
              zip/right ;; this is the subvector
          (zip/append-child :c) ;; add as the last child
          zip/root) ;; back to the top to appreciate our work
; [1 [:a :b :c] 2 3 [40 50 60]]

zip/append-child is a good choice here because we are appending to the end of the vector, and that is what it always does. What about prepending? This time we’ll put a :aa at the beginning.

(-> zzz zip/down zip/right zip/down ;; down to the :a node
       (zip/insert-left :aa)
[1 [:aa :a :b] 2 3 [40 50 60]]

insert-left: does just what it says. Let’s delete any numbers greater than 10:

(defn death-to-big-numbers
        (if (zip/end? loc)
          (if (or (not (integer? (zip/node loc)))
                  (< (zip/node loc) 11))
            (recur (zip/next loc)) ;; base case: keep moving
            (recur (zip/next (zip/remove loc)))))) ;; evil big numbers
(zip/root (death-to-big-numbers zzz))
; [1 [:a :b] 2 3 []] ;; they're gone!

This is our standard zipper-walking format. Most of the work is just figuring out which locations need to be removed and passing them off to zip/remove.

Immutable gotchas : let’s wrap

Now let’s try something more complicated. We’re going to start over with a new vector tree.

(def z2 (zip/vector-zip [1 2 :a :b [3 4 :c :d 5] :e]))

Our goal this time is to make sure that all keyword nodes are wrapped together in their own subtrees, with adjacent keywords wrapped together. We want to be able to take a tree like z2 and produce this:

[1 2 [:a :b] [3 4 [:c :d] 5] [:e]]

(I encountered this problem in an HTML document, trying to wrap adjacent, indented <p> elements inside <blockquote>s.)

Here we go. Strategy number one. We’ll look for keywords and, if we find one, we’ll wrap it and then start appending any right siblings.

First, let’s just try wrapping each keyword individually:

Strategy 1 - a: wrap and go…​ down
(defn strategy-one-a [loc]
        (if (zip/end? loc)
          (if (keyword? (zip/node loc))
            (recur (zip/next (zip/edit loc #(vector %))))
            (recur (zip/next loc)))))
;; wait, don't do this: infinite loop

What? Why is this infinite? Well, when we edit the node, we create a new node, [:a], and then we say zip/next. Except that the next node in this case is down…​ to that :a again! Which we wrap again, and then go find again and again and again. next is relative to the new zipper, not the old one from before the edit. This is one of those immutable/pure-function mind warp things: you want to be able to grab the new state and hold on to the old one too. But here, the old one, where zip/next was going to logically lead to the right sibling, is no longer available to us. Or, to be more precise, it is still available, but it is now a parallel universe where the edit we want to do never really happened. We can’t have the new edit and the old position at the same time, because the old position only exists in a world where the edit never happened.

Back to reality: let’s introduce some logic to make sure we don’t try to wrap keywords that are already inside vectors containing only keywords. In other words, if we run into [:x :y :z] we leave it alone.

First, we write a little function that looks up at a keyword’s parent to see if all the siblings are already keywords:

(defn parent-ok? [loc]
        (when (zip/up loc)
          (every? keyword? (zip/children (zip/up loc)))))

(Note here that zip/children returns nodes, not locs, so we can feed the results directly to keyword? without having to call zip/node.)

And then we plug it into our previous tree walking function:

Strategy 1 b
(defn strategy-one-b [loc]
        (if (zip/end? loc)
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc)))
            (recur (zip/next (zip/edit loc #(vector %))))
            (recur (zip/next loc)))))

; [[1 2 [:a] [:b] [3 4 [:c] [:d] 5] [:e]] :end] ;; here we see the ended-out zipper

Good: no infinite recursion, and all of our keywords are wrapped, but individually, not grouped together like we want.

Let’s try grabbing them from the left-most keyword. As soon as we hit a keyword, we’ll try to gobble up any keywords to the right before moving on.

Strategy 1 c (starting to get ugly)
(defn strategy-one-c [loc]
        (if (zip/end? loc)
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc))) ;; up to here, same as before
            (let [right-keys (take-while #(keyword? %) (zip/rights loc))]
                (recur (zip/next
                               (reduce (fn [z n]
                                          (zip/append-child z n)) (zip/edit loc #(vector %)) right-keys))))
            (recur (zip/next loc)))))

The basic structure is the same as before, except that now we try to incorporate keyword nodes to the right of the locations that need to be dealt with (ie. keyword nodes that are not in a parent node containing only keyword children, as established by the parent-ok? function we wrote earlier). Conveniently, the zip library provides zip/rights, which gives us a sequence of nodes (not locs, mind you) that we grab with take-while.

So far nothing extraordinary.

If we do have interesting siblings, we need to bring them into the node we are going to create. That is what we do with reduce here, and it is probably worth breaking down the logic there.

(reduce (fn [z n]
           (zip/append-child z n)) (zip/edit loc #(vector %)) right-keys)

This is the three argument form of reduce, where the second argument provides an initial value. Here our starting value is going to be the original keyword node edited to be wrapped inside a new vector. Then, for each of the keyword siblings to the right, we are going to keep calling zip/append.

It would be the equivalent of this, if we could know ahead of time that right-keys contained :a, :b and :c:

(zip/append (zip/append (zip/append (zip/edit loc #(vector %)) :a) :b) :c)

So, we gather up some more courage and we run the thing:

(strategy-one-c z2)
; [[1 2 [:a :b] [:b] [3 4 [:c :d] [:d] 5] [:e]] :end]

Well, this kind of works, in that we are getting our keywords grouped together. Obviously, we haven’t done anything to remove the nodes that we’ve copied into our aggregating nodes. And look what happens if we have more successive keyword nodes in our input:

(strategy-one-c (zip/vector-zip [1 2 :a :b :c :d :e 3]))
; [[1 2 [:a :b :c :d :e] [:b :c :d :e] [:c :d :e] [:d :e] [:e] 3] :end]

Oh dear. We can’t just leave those siblings out there, because they end up getting siblingized themselves. Before we send our big reduced, zip/appended, zip/edited zipper back to zip/next, we need to prune the original keyword siblings from the zipper. Okay, one last try.

Works but…​ (Strategy 1 d)
(defn strategy-one-d [loc]
        (if (zip/end? loc)
          (if (and (keyword? (zip/node loc))
                   (not (parent-ok? loc)))
            (let [right-keys (take-while #(keyword? %) (zip/rights loc))
                  loc-no-rights (reduce (fn [z _] (zip/remove (zip/right z))) loc right-keys)
                  loc-w-incorporated-keys (reduce (fn [z n]
                                                                (zip/append-child z n))
                                                            (zip/edit loc-no-rights #(vector %))
                 (recur (zip/next loc-w-incorporated-keys)))
            (recur (zip/next loc)))))

I’ve shifted things around a little here for clarity, but it really isn’t too different from the previous rendition. Inside the let, loc-w-incorporated-keys is the result of the same reduce call. The difference here is that before building our final zipper, we remove the siblings.

zip/remove, after blowing away the node, automatically moves to whatever the previous location would be, as determined by calling zip/prev. This could be a problem if we didn’t know that we were working on right siblings, because if we removed the leftmost node, we would end up at the parent node suddenly, which could confuse things considerably. That won’t be a problem here though. This behaviour also allows us to use the combination of zip/right and zip/remove for as many times as we like, because we just keep removing the next node over each time.

loc-no-rights is the intermediary version of the zipper. And because right-keys is holding nodes and not zipper locations, we can come back to them and append them before moving on with zip/next.

Deep breath…​

(strategy-one-d z2)
; [[1 2 [:a :b] [3 4 [:c :d] 5] [:e]] :end]
(strategy-one-d (zip/vector-zip [1 2 :a :b :c :d :e 3]))
; [[1 2 [:a :b :c :d :e] 3] :end]

I confess that at this point I’m just glad to have something that works. I think, however, that you’ll agree that this is just too complicated, too messy and too painful for what should be a relatively simple operation. Going through all these contorsions was meant to show some of the problems that crop up when learning about zippers, and how we have to bend our thinking a little (or a lot) when using them. Now let’s try to do this right…​

Strategy 2

This time we are going to do everything by editing parent nodes. Most of what we do will be in what I’ve been calling “node space” and should be simpler. Each time we hit a branch node, we will look at its children, wrap any keyword nodes and then keep moving.

The first step is to write a generic wrapping function. We want it to take a list of things and group together adjacent keywords. Here is what I came up with:

(defn wrap-keywords [lst]
        (into []
               #(if (keyword? (first %))
                  (list (apply vector %))
                (partition-by keyword? lst))))
(wrap-keywords [:a :b :c 1 2 [:d :e] 3 4 :f])
; [[:a :b :c] 1 2 [:d :e] 3 4 [:f]])

The trick here is that partition-by just breaks up the initial sequence into segments depending on how the predicate reacts to each item. With the sequence in the example, it produces this:

(partition-by keyword?  [:a :b :c 1 2 [:d :e] 3 4 :f])
; ((:a :b :c) (1 2 [:d :e] 3 4) (:f))

The rest of our function, the mapcat part, just turns the keyword segments into vectors (apply vector), wraps each vector in a list to put it back on the same level as its siblings, and then concats the whole thing back together, essentially erasing the lists but leaving the vectors. And finally, we use into to box everything up in a vector.

(Note that we can use vectors here because we are using vector trees. In more serious code it would be wiser to use zip/make-node.)

Now it’s time to try this out on a tree.

Strategy 2: go after the parents
(defn strategy-two [loc]
        (if (zip/end? loc)
          (if (and (zip/branch? loc)
                      (some keyword? (zip/children loc))
                      (not (every? keyword? (zip/children loc))))
            (recur (zip/next (zip/edit loc wrap-keywords)))
            (recur (zip/next loc)))))


(strategy-two z2)
; [[1 2 [:a :b] [3 4 [:c :d] 5] [:e]] :end]
(strategy-two zzz)
; [[1 [:a :b] 2 3 [40 50 60]] :end]

In the interest of full disclosure, I messed up the first time through this and fell into the same infinite recursion trap as before. We still need to make sure that the keyword nodes aren’t already wrapped, since otherwise we’ll end up wrapping them forever. That’s what the (not (every? keyword?…)) stuff is all about.

The rest should be pretty self-explanatory by now. We check to see if we are on a parent node, using zip/branch?. We peek at the node’s children to see if any of them are keywords, and then, when the planets are correctly aligned, we edit the node as though it were a simple seq. Because zip/edit modifies the node, not the zipper.

This, in my opinion, is one of the tricky parts to zippers: knowing when to be in zipper-space or in node-space, knowing how to leverage one or the other depending on what you need to do. At the same time, having these two options, and being able to quickly switch from one to the other, is what gives Clojure zippers all their power.