I’m working on a project where I need to extract textual data from a lot of HTML. Today I was working on a way to normalize my text input by parsing the HTML, especially the entities, and outputting raw text. I started to look around for a library to do this and everything seemed to point to Christophe Grand’s Enlive library.

This library was not unfamiliar to me, because I’ve used it for templating in a Ring app. I also knew that it could be used for webscraping, but I had never really paid attention to that aspect.

So I’ve got my library, I’ve got my data, and I start trying to learn how to break HTML down into simple text. I do this (en is for enlive here):

(en/html-resource  (java.io.StringReader. "<em>stuff</em>"))

and I get this:

({:tag :html, :attrs nil, :content
       ({:tag :body, :attrs nil, :content
                     ({:tag :em, :attrs nil, :content ("stuff")})})})

Now that data structure looks kind of familiar. It should, since that is the standard way of representing XML trees in Clojure.

Suddenly something clicks deep in my mind. “We can use a zipper!”

And indeed, after require-ing clojure.zip :as zip, we can do this:

(zip/xml-zip (first (en/html-resource  (java.io.StringReader. "<em>stuff</em>"))))
[{:tag :html, :attrs nil, :content
       ({:tag :body, :attrs nil, :content
                     ({:tag :em, :attrs nil, :content ("stuff")})})} nil]

And now we are off to the races! We have our handy zipper toolbox at our disposal. (I wrote three articles about zippers a few months ago: one, two and three.)

The last of the three articles provides a pretty good pattern for this situation. Here is what I ended up doing:

(defn normalize-text
  "Extract text: lower case everything, parse html, remove tags."
  [text]
  (->> text
    clojure.string/lower-case
    java.io.StringReader.
    en/html-resource
    first
    zip/xml-zip
    (iterate zip/next)
    (take-while (complement zip/end?))
    (filter (complement zip/branch?))
    (map zip/node)
    (apply str)))

The use of the →>> macro here might make this slightly unintuitive if you aren’t used to it. Note how, serendipitously, all the forms here are either single-argument functions like first, or multiple argument functions where the key argument comes at the end. I guess today I did feel lucky.

Anyway, let’s break this down. The first few lines are what we’ve already seen, parsing HTML with Enlive’s html-resource and making an xml zipper. Let’s take it from there.

;; we make the zipper
(def hzip (->> "<p>sample <em>text</em> with words.</p>"
                  clojure.string/lower-case
                  java.io.StringReader.
                  en/html-resource
                  first
                  zip/xml-zip ))

;; and we walk through it and grab the text
(apply str (map zip/node
                   (filter (complement zip/branch?)
                     (take-while (complement zip/end?)
                       (iterate zip/next hzip)))))

;; with the result
"sample text with words."

How did that work? The take-while and iterate parts walk through the tree, lazily. We filter out all the branch nodes, which in this case means everything that isn’t…​ text. Which is what we are looking for. At this point all we have to do is map over the locations (remember, we are in zipper-space where everything is a location containing the entire tree — see my other articles about that) with zip/node to get the individual textual payloads. Make it all into a proper string by applying str and we’re done.

Once again, nothing particularly amazing about all of this, though I was pleased with the elegance and brevity of a solution that I was able to cobble together rather quickly. There are other ways of getting the same results too. But this does show how a little bit of familiarity with Clojure’s zipper libraries can be surprising useful.

Comments