Tuesday, July 15, 2014

Getting to know your Solr users with LucidWorks Banana

I am taking a break from the IDOL to Solr Migration lessons learned because I wanted to share with you something kind of cool that I’ve been playing with. These days, I generally look at new technology mostly as more and more incompatible crap, however every once in a while I stumble on something with potential and get excited about it. In this case, it is a Kibana port called Banana from LucidWorks and a little python parser that gives me some better than usual insight into my users; data elements such as paging, sorting, faceting and filtering as well as specific terms lists.

What Kibana and Banana allow you to do is run report on the contents of your index by specifying fields you want to query and how you would like to display the data. For example, you can easily generate histogram of data with bars for every unique value of the field, or even a pie chart if there aren't many unique values. If you have time and event data, such as log files events, you can plot that data over time as well. So what’s the big deal here? Well the big deal is that you can have a number of these panels set up in your dashboard and that each element in the dashboard is actionable. Another big deal, is that these reports can be ran on almost any kind of data that you can feed to Solr.

For example, if you are plotting time series data, lets say for the past 90 days, and spot some interesting activity, you can simply select that range in the chart and a new filter will be automatically added to the report, at which point all data will refresh and only include your selected criteria. This goes for other elements, for example if you have a few elements in bar graphs, you can click on one of the bars and it will be added to the filter criteria. This allows you to dynamically filter and refine the report as needed without having to write SQL queries or look in various tables for data. I mean the whole thing is so easy even I could figure it out.

So this awesomeness combined with absence of any interesting plans for the three day 4th of July weekend inspired me to write a quick python script to parse Solr log files and index the data elements into Solr for Banana Analysis.  Before long I was exploring our user’s activity on the site, the dashboard ended up looking something like this, several data elements are removed and pixilation added on purpose. 

General log parsing can only provide some rudimentary analysis, query latency and number of queries can be pulled from so many sources and it is so trivial, it's not even worth writing about. What I wanted to see is what categories users searched in, what terms they used, how they sorted their results and if they used any other filtering criteria. 

So in the python code I massaged the data a bit. For example, I broke out fq criteria into separate data elements; I would append each fq field from the query to a general fq doc field and create a new doc field for each field in fq, so if the user were filtering on a field called foo with a value of bar, fq would look like fq=foo:bar it would be indexed as:

fq: foo


This allowed me to set up a terms panel for the fq field in Banana to see what fields users filtered on, then for interesting fields, I set up a separate terms panel on a specific fq field field (i.e. fq_foo from previous example) to examine the values used for filtering. Naturally, most of the values were standard since most of these are facets on the front end, however they directly mapped to clicks and it provided a great birds eye view of what categories users searched in.  

Then I decided to get a bit more info on how the users sorted their searches in different categories. Now that is quite a bit of data and would be hard to visualize. Thankfully, banana came with a pretty slick heat map panel that allowed me to visualize just that. With the heat map, I can easily see what sorts were used in different categories. Keep in mind that this one uses facet pivot  and if you have a collection with multiple shards, it isn’t going to work.

Now, let’s take it a step further and see what we can do about detecting paging for different searches. We have rows and start parameters in each query, rows tells you the number of documents to return and start is the offset from the beginning of search results. Most apps, generally return a result a fixed set of results per page, so it makes the data pretty neat. These can be plotted as terms bar plots in Banana terms. 

This shows you the number of paged queries, why is that cool? Well, if users frequently have to navigate to the second and third page of your search results, it means they are not finding what they are looking for on the first one, so relevance may need to be tweaked. 

So what else is cool? Well, I decided to tokenize the q argument on the whitespace (and remove other characters via python). This does a really cool thing because now, instead of just looking at the most frequent queries, I can look at individual terms. Furthermore, I can filter down on them, so let’s explore what happens here… I see a list of terms in the dashboard, I filter down on a term such as “paper” and as the dashboard refreshes, all the panels start showing me results only for items that contain paper in q, including the terms panel. At this point I can evaluate frequent category filters, sorts, other fq criteria and the number of queries over time for this one term. The terms panel also now allows me to look at other terms that were used in combination with paper, such as towel, toilet, copier, clip, etc… which is pretty cool. The actual queries are still available, there is a panel in Banana that will show you the raw data from Solr, so if necessary, you can drill down and see the un-tokenized version of the query. 

Another cool thing the dashboard can be used for is detection of spidering activity. Generally, spidering occurs from multiple IP’s and unless they are sending malformed URL’s they are pretty hard to detect unless there is a noticeable spike in query load. Even when they are noticed, they usually take a bit of grepping to fully understand.  However, they usually page quite a bit and do the same “kind” of searches. 

Banana is great for this because you can very easily and quickly filter down on seemingly unknown data and look for patterns, spikes in volumes or weird queries or deep paging. Any weirdness or possible crawling activity can be quickly identified and visually confirmed. Huge plus. 

In case you couldn’t tell, I am pretty excited about this thing. The really exciting part is that this can essentially be ran on any kind of data, not just logs. Just shove it in Solr and make some dashboards. The easy to use UI makes it rather approachable even from a standpoint of non-technical users. I am excited about future functionality as well, this can be a pretty powerful analysis tool. 

If you wanted to set up something like this yourself, you can download Banana here:

My spaghetti code is available here:

Keep in mind it is a quick script and since there is a large variation of different log types that can be configured in Solr, it will may not work with your log files. I am working on making it more production ready and to be able to parse more types of Solr Logs. If you have any feedback, please let me know. If you’d like additional details on implementation let me know as well, I would be more than happy to write up a post with implementation instructions.