Whenever I encounter some cool new technology, it goes immediately and unconsciously goes into either the category of “things I know I’ll never use”, or the category “maybe someday”. It is pretty easy to tell if something is part of even the widest interpretation of your personal problem space.

For a long time, Elasticsearch was on the second of those lists. And it even seemed as though anyone in the server-side “web space” should have a search solution in there quiver. But search is hard, and there are always more urgent things to do. Until now!

This week I finally started using Elasticsearch in a project. This article is going to be a n00b’s eye view of what getting started with Elasticsearch feels like, with a look at some of the kinds of problems it can help with. This is certainly not a tutorial! The code I’ve included is just for explanatory purposes and is not example code of any kind!

What it is

Briefly, Elasticsearch is a Lucene-based search tool (engine/platform/database/server/whatever). Lucene is the venerable Java based search engine. Solr is Elasticsearch’s direct competitor and does basically the same thing.

Traditionally, when choosing between Solr and Elasticsearch, one would be told that Elasticsearch was built to be distributed (hence the “Elastic” part). That difference may not be so anymore. Solr has progressed recently and uses Zookeeper. So that now they can say this on their website:

Built on tried and true Apache Zookeeper, Solr favors consistency and avoids hairy split brain issues common to other approaches

— http://lucene.apache.org/solr/features.html#solrcloud

They’re talking about Elasticsearch there. There is lots more to read about this debate. You can start with Kyle Kingsbury’s (or @aphyr) Jepsen testing of Elasticsearch, and move on to this piece about Solr. The problems that show up with Elasticsearch here would be important if Elasticsearch was your primary data store, which I have indeed heard people recommend. If your data is elsewhere, like in a database, and Elasticsearch is just a search engine, you should be fine.

For me though, none of that made any difference, because all my data was going to fit in a single shard. The biggest reason that made me choose Elasticsearch, was simply that there is a solid Clojurewerkz library called Elastisch. As I started learning more about Elasticsearch, I realized that it has other features that make it an interesting choice for other reasons. Which brings us to my project…​

The problem

The goal was fairly simple. We have around 100,000 documents, web pages from the past 15 years or so. We also have a list of about 12,000 keywords that we are interested in and we would like to have an idea of the evolution of their importance over the years. We also know that the results won’t be perfect because matching keywords to texts is no simple task and because we aren’t ready to get into advanced keyword detection algorithms at this point.

The basic approach is pretty simple:

  • Index the 100,000 documents.

  • Search them using the keywords.

  • Generate a report with the totals for each month over the past 15 years.

Looping through each month since 2000 and running a search on a keyword seems ungainly and possibly too slow (although who knows, with a data set this size). This is where some of Elasticsearch’s features become really useful.

Getting started

This isn’t a tutorial, at all. I will mention some of the difficulties I had along the way though.

I first installed Elasticsearch from Elasticsearch’s apt repository. This installation was simple enough but Elasticsearch would silently die whenever I started it. /etc/init.d/elasticsearch stop would report that a PID was found but that there was no running process. Is this because Elasticsearch runs as java? I don’t know. I moved on to the manual “install” which, at least for development purposes, amounts basically to untaring the tarball and running bin/elasticsearch.

Interaction usually happens via http. The examples all use curl. Since I am doing this in Clojure, using Elastisch, the Clojure REPL is also a great way to interact with the search engine. For Emacs people, there is also es-mode, which is really nice because you can have a file full of different Elasticsearch requests and then type C-c C-c to send one of them to the server. Highly recommended!

Keyword mechanics

Finding keywords

Identifying keywords in texts is obviously a huge problem. The keywords I had to deal with were more than just simple words. Here is a sample:

  • Sciences humaines

  • Sémiotique des médias

  • Proust, Marcel

  • Vigny (de), Alfred

  • Empire

  • revue Culture et musées

Exact matches would not necessarily be useful. What if someone writes “`empires`” or “`une science humaine`” or “`les sciences, mêmes humaines`” (which would mean something like “sciences, even if they are human sciences”)? I actually wrote an initial version of this project that just used straight up string matching. It worked, in the sense that it produced somewhat meaningful results, for some keywords. Others would disappear completely. In the list above, something like “revue Culture et musées” might not work well, because the first word, which means “journal” might often be omitted.

These are classic human language parsing problems. The nice thing about using a tool like Lucene, is that a lot of this has been dealt with already. As you may have noticed, these keywords are in French. Lucene knows how to index French. Just tell Elasticsearch that a field is in French.

{ :body        {:type "string", :store "yes", :analyzer "french"}
  :title       {:type "string", :store "yes", :analyzer "french"}}

(Note that I’m writing this in Clojure. Classic Elasticsearch is in JSON, so the Clojure keywords get converted into strings.)

Fast forward: we’ve pulled the 100,000 documents out of their MySQL table and indexed them. I can run searches against them, I can find stuff, all very cool. But search, as such, is not the primary goal here. At this point I was reassured to be moving forward with Elasticsearch but unsure about how I was going to get my keyword data.

In comes the Percolator.

The Elasticsearch percolator reverses the typical relationship between a query and the document that is found. With the percolator, queries are stored ahead of time, as documents. Then we can send a document, I mean like a real document, and see which of the pre-selected queries match. It’s like a backwards version of Google, where you would type in a URL and get back a zillion pages of queries that would have matched that web page. You use a document to query the queries.

The idea is to see which of our keywords might pertain in some way to the documents. We are using a document to search for keywords.

The basic setup is simple enough. Here is my function for installing an individual keyword as a percolator document:

(defn register-tag-percolator [es-conn tag tagnorm id]
  (nperc/register-query
    es-conn
    "ourindex"
    (str "tag-" tagnorm "-" id)
    :query (t/query-from-tag tag)))

nperc refers to the clojurewerkz.elastisch.native.percolation namespace. (The native part means that we aren’t using the RESTful interface, but the native Java interface. This is a (small) advantage of Clojure’s Java interop.) “ourindex” is the index where all of our data is already stored. The argument after that, (str "tag-" tagnorm "-" id) is the name of the percolator query document that we are storing. Our keywords have a normalized form and a numeric id, so we are just assembling a human-readable string. This will be important later though. The t/query-from-tag function is our own function. It will build an Elasticsearch query object from the actual tag string. More about that soon.

So we loop over our 12K keywords with this function. Now we can throw a document at them and see what happens. Here is our function for doing that:

(defn percolate-doc
  [es-conn doc]
  (nperc/percolate es-conn "ourindex" "page" {:doc doc}))

I agree, there isn’t much to see here. I’m showing this function just to point out that we can percolate any arbitrary document, even if it isn’t in the index yet. The doc argument to this function can be any JSON object with the same fields as the other documents. This is very important, because it is going to allow me to grab the keywords before I index the document, then include the results of that work in the document.

Now I can send a document to the percolator and get an answer like this

{:count 117,
 :matches
 ("tag-histoire-sociale-12043"
  "tag-acteur-6817"
  "tag-sciences-humaines-7597"
  "tag-pacte-social-8139"
  "tag-theme-6502"
  "tag-siecle-14361"

... etc.)}

And at this point (assuming I’m satisfied with these results, but I’ll get back to that), I can take these 117 keyword matches and make them part of the document before I index it. Here is the function that does it:

(fn [r]
  (let [page (page-to-es r)
        {tags :matches} (percolate-page es-conn page)
        clean-tags (mapv (fn [t]
                           (last (str/split t #"-" ))) tags)]
    (esnd/create es-conn "ourindex" "page"
      (assoc page :autotags clean-tags))))

This function is the :row-fn argument to a clojure.java.jdbc query call. That means that r is a row in an SQL result set.

So what’s going on here? page-to-es is formatting the database row as a page document for the Elasticsearch index that we’ve already defined. The next line, with {tags :matches}, sends the document to the percolator to get a list of keywords, like we just saw. We then map over that to grab just the numeric part of the tag names, and finally we index the document, associng in a vector of keyword ids as the :autotags field.

Aggregating it all back together

Our lists of keywords that are now part of each document that we put into the index are not going to improve search results, since they already are search results. The reason we are storing them this way is that we want to look at the entire collection at once so that we can graph the frequency of different keywords over time.

This is where Elasticsearch’s aggregations come in. Here is how we get our report back:

GET /ourindex/page/_search?search_type=count
{
  "query": {
    "match": {"autotags": "8442" }
  },

  "aggs":
  {
    "tags":
    {
      "date_histogram" : {
        "field" : "date",
        "interval" : "month",
        "format" : "YYYY-mm-dd"
      }
    }
  }
}

(You’ll notice that I’ve switched to JSON notation here: no particular reason except that I haven’t coded this part into my application yet.)

Running this query produces something like this:

{
  "took":13435,
  "timed_out":false,
  "_shards": {
    "total":5,
    "successful":5,
    "failed":0},
  "hits": {
    "total":3692,
    "max_score":0.0,
    "hits":[]},
  "aggregations": {
    "tags":
    {"buckets":
     [{"key_as_string":"2010-00-01","key":1283299200000,"doc_count":24},
      {"key_as_string":"2010-00-01","key":1285891200000,"doc_count":103},
      {"key_as_string":"2010-00-01","key":1288569600000,"doc_count":87}
      // etc.
      ]
    }
  }
}

That little bit of query code was enough to give us a doc_count for each month. The rest, as they say, is histograms!

It works but does it mean anything?

So far I’ve basically been telling the story of a proof of concept. It was very reassuring to see that Elasticsearch’s features could get us from our raw data to the output that we needed in very few steps. And, without caching anything, that is, without building up an intermediate index somewhere from which we would have generated our statistics. Once the keywords and the documents are indexed, the report generation is live. From a nuts and bolts perspective, this part was very satisfying.

But…​ does it mean anything? How good are the results?

This is what I’ve learned so far, when working with “medium data”: it is easy to draw a nice looking chart, but how do you know that you are measuring something meaningful? In this case, the question boils down to the quality of the matches in the percolation phase. Are those 117 keywords relevant to the document that they matched?

I’ve glossed over some important details in telling this part of the story, and now I want to come back to some of the other problems that I had to solve along the way. In the percolating function that I showed above, there was a query-from-tag function. This function’s job is to take a string like “Vigny, Alfred (de)” and turn it into a useful Elasticsearch query.

There are several problems here. Most of the time, this nineteenth century poet’s name will just be written as “Vigny”. The first problem is that, because of the way that Elasticsearch (and Lucene?) analyze the documents that they index, “Vigny”, a very distinctive proper noun, can be also find documents with vigne (grapevine). As far as I can tell, the question of whether to analyze a text field or not is an index-level decision. Since I want full text search on the text body of each page document, it has to be analyzed; that means that the search query will be analyzed too, and so vigne will be considered as an equivalent of “Vigny”. There might be a way to get around this but I did not find one.

So what about using “Alfred de Vigny” instead? A query could easily be written where “alfred” and “vigny” would both have to be present to get a hit. Except that would mean missing all the occurrences of “Vigny” by itself. So we can’t do that.

The next logical approach would be to give a higher score to matches on the complete name and then maybe eliminate the lower scoring hits. This is where we run into something that is, in my opinion, one of the major limitations of the Elasticsearch percolator for this kind of problem. The results of a percolator query, ie. throwing a document at 12,000 queries to see which ones stick, is that, unlike in traditional search queries, the results are not scored. It is binary: hit or miss.

On my first try, most of the documents were getting between 700 and 1000 keyword hits. By delving into the way that Elasticsearch queries work, I was able to fine tune the matching to a certain degree, ending up in the range of 100 to 150 hits per page. If a person were to manually assign keywords, it would probably be more like ten to 30 keywords for a document. I don’t know much about machine learning, but something tells me that a discrepancy of about an order of magnitude between human and machine keyword guessing might not be so bad considering how little work actually went into setting this up.

I’m sure that there are technical reasons why this is true. A percolator query is a strange thing, and this version of the percolator has only been around for a short time. The results can also still be useful. As you can see in the sample results, the document got hits from about 1% of the available keywords. This is enough to give a rough idea of the evolution of certain concepts over time, but we have to remember that the results are rough.

If we look back at the list of sample keywords I gave at the beginning, there was this one: “`revue Culture et musées`”. It is the name of a journal. It will receive an incredible number of false positives because all three of its terms are going to be present in various configurations in a lot of documents, most of the time not referring to the journal in question.

For our purposes this is not a major issue. The idea is to get approximate results with as little human intervention as possible. If we really needed higher quality keyword matching, one solution would be to simply invest more time in the curation of the keyword queries, possibly writing some or all of them more or less by hand, or adding metadata so that we can tell the system that the word “`revue`” needs special treatment in the query. Elasticsearch provides a fairly elaborate query DSL that would permit this.

And as a suprise bonus…​

One positive result of this work is that, thanks to the machine generated keywords, we are going to be able to provide a limited set of keyword suggestions for our users, who will be able to manually tag pages. In my opinion, this alone might make the whole project worthwhile.

Not to mention that we could even use this thing as a search engine if we had to.

Comments