I am taking a break from the IDOL to Solr Migration lessons
learned because I wanted to share with you something kind of cool that I’ve
been playing with. These days, I generally look at new technology mostly as more and more incompatible crap, however every once in a while I stumble on something with potential and get excited about it. In this case, it is a Kibana port called Banana from LucidWorks and a little python parser that gives me some better than usual insight into my users; data elements such as paging, sorting, faceting and filtering as well as specific terms lists.
What Kibana and Banana allow you to do is run report on the
contents of your index by specifying fields you want to query and how you would
like to display the data. For example, you can easily generate histogram of
data with bars for every unique value of the field, or even a pie chart if there aren't many unique values. If you have time and event data, such as log
files events, you can plot that data over time as well. So what’s the big deal
here? Well the big deal is that you can have a number of these panels set up in
your dashboard and that each element in the dashboard is actionable. Another big deal, is that these reports can be ran on almost any kind of data that you can feed to Solr.
For example, if you are plotting time series data, lets say for the
past 90 days, and spot some interesting activity, you can simply select that range
in the chart and a new filter will be automatically added to the report, at
which point all data will refresh and only include your selected criteria. This
goes for other elements, for example if you have a few elements in bar graphs,
you can click on one of the bars and it will be added to the filter criteria.
This allows you to dynamically filter and refine the report as needed without
having to write SQL queries or look in various tables for data. I mean the whole thing is so easy even I could figure it out.
So this awesomeness combined with absence of any interesting plans for the three day 4th of July weekend inspired me to write a quick python script to parse Solr log files and index the data elements into Solr for Banana Analysis. Before long I was exploring our user’s activity on the site, the dashboard ended up looking something like this, several data elements are removed and pixilation added on purpose.
So this awesomeness combined with absence of any interesting plans for the three day 4th of July weekend inspired me to write a quick python script to parse Solr log files and index the data elements into Solr for Banana Analysis. Before long I was exploring our user’s activity on the site, the dashboard ended up looking something like this, several data elements are removed and pixilation added on purpose.
General log parsing can only provide some rudimentary
analysis, query latency and number of queries can be pulled from so many sources and it is so trivial, it's not even worth writing about. What I
wanted to see is what categories users searched in, what terms they used, how
they sorted their results and if they used any other filtering criteria.
So in the python code I massaged the data a bit. For
example, I broke out fq criteria into separate data elements; I would append each fq field from the query to a general fq doc field and create a new doc field for
each field in fq, so if the user were filtering on a field called foo with a value of bar, fq would look like fq=foo:bar it would be indexed as:
fq: foo
fq_foo:bar
This allowed me to set up
a terms panel for the fq field in Banana to see what fields users filtered on,
then for interesting fields, I set up a separate terms panel on a specific fq field
field (i.e. fq_foo from previous example) to examine the values used for
filtering. Naturally, most of the values were standard since most of
these are facets on the front end, however they directly mapped to
clicks and it provided a great birds eye view of what categories users
searched in.
Then I decided to get a bit more info on how the users
sorted their searches in different categories. Now that is quite a bit of data
and would be hard to visualize. Thankfully, banana came with a pretty slick
heat map panel that allowed me to visualize just that. With the heat map, I can
easily see what sorts were used in different categories. Keep in mind that this one uses facet pivot and if you have a collection with
multiple shards, it isn’t going to work.
Now, let’s take it a step further and see what we can do
about detecting paging for different searches. We have rows and
start parameters in each query, rows tells you the number of documents to
return and start is the offset from the beginning of search results. Most apps,
generally return a result a fixed set of results per page, so it makes the data
pretty neat. These can be plotted as terms bar plots in Banana terms.
This shows you the number of paged queries, why is that
cool? Well, if users frequently have to navigate to the second and third page
of your search results, it means they are not finding what they are looking for
on the first one, so relevance may need to be tweaked.
So what else is cool? Well, I decided to tokenize the q
argument on the whitespace (and remove other characters via python). This does
a really cool thing because now, instead of just looking at the most frequent
queries, I can look at individual terms. Furthermore, I can filter down on
them, so let’s explore what happens here… I see a list of terms in the
dashboard, I filter down on a term such as “paper” and as the dashboard
refreshes, all the panels start showing me results only for items that contain
paper in q, including the terms panel. At this point I can evaluate frequent
category filters, sorts, other fq criteria and the number of queries over time
for this one term. The terms panel also now
allows me to look at other terms that were used in combination with paper, such
as towel, toilet, copier, clip, etc… which is pretty cool. The actual
queries are still available, there is a panel in Banana that will show you the
raw data from Solr, so if necessary, you can drill down and see the
un-tokenized version of the query.
Another cool thing the dashboard can be used for is
detection of spidering activity. Generally, spidering occurs from multiple IP’s
and unless they are sending malformed URL’s they are pretty hard to detect
unless there is a noticeable spike in query load. Even when they are noticed,
they usually take a bit of grepping to fully understand. However, they usually page quite a bit and do
the same “kind” of searches.
Banana is great for this because you can very easily and
quickly filter down on seemingly unknown data and look for patterns,
spikes in volumes or weird queries or deep paging. Any weirdness or possible crawling activity can be quickly identified and
visually confirmed. Huge plus.
In case you couldn’t tell, I am pretty excited about this
thing. The really exciting part is that this can essentially be ran on any kind
of data, not just logs. Just shove it in Solr and make some dashboards. The
easy to use UI makes it rather approachable even from a standpoint of non-technical
users. I am excited about future functionality as well, this can be a pretty
powerful analysis tool.
If you wanted to set up something like this yourself, you
can download Banana here:
My spaghetti code is available here:
Keep in mind it is a quick script and since there is a large
variation of different log types that can be configured in Solr, it will may
not work with your log files. I am working on making it more production ready
and to be able to parse more types of Solr Logs. If you have any feedback,
please let me know. If you’d like additional details on implementation let me
know as well, I would be more than happy to write up a post with implementation
instructions.
No comments:
Post a Comment