Wednesday, July 9, 2014

IDOL To Solr Migration Lessons Learned - Part 4 Stuff I miss about IDOL

There is going to be much more good info on Solr coming up, especially next week's post which will cover our process for relevance tuning and things we learned. But, in the mean time I wanted to take a stroll down memory lane and mention some of the things I miss about IDOL. These items, although not major, still come up to haunt me now as well as during the deployment. Oh, and I promise, I am not getting any money from HP for this. 

Let’s start off with an easy one; documentation. When you buy IDOL, you get a plethora of manuals, so much, in fact, that you could club a baby seal to death with them. There is a getting started guide, to ease you into IDOL, Server Admin documentation as well as much reference material that will provide you everything you need to know about every single type of call you can make to IDOL. It seems intimidating at first, but it is truly priceless. It is up to date, consistent and informative.

With Solr, we were pretty much at the mercy of online forums, some blog posts and the wiki pages. The wiki pages provided some good information, but some areas definitely need improvement. Some parameters only had a sentence or a few words of a description which drove me nuts. Forums support was good, but then again, you have to provide quite a bit of information about what you are trying to do, reasons for which may be hidden in some old business requirements that don't always make sense. Naturally, that information has to be disclosed because the first ten replies will echo a message of “Well, whachya doing that for?”. 

We procured support for Solr and it came in handy, we couldn't have done it without those guys. However, at times I found myself reaching out to them with a question for which there was a very valid answer that was not documented anywhere. A bit frustrating, but something workable.

Another thing is logging. In IDOL, there are several distinct logs that are written out, they are marvelous, and they tell me exactly what it is doing. There is one for queries, there is another one for the application, another for indexing, and a few others and they are all neat, informative and concise, plus I can tune logging detail per type of log.

With Solr, there are several loggin modules that can be configured to your liking and log level for different kinds of items is very adjustable, however I failed to find a good balance between the right information and log volume. Maybe, I need to spend a bit more time on it, but this process is much simpler in IDOL.
Aside from general log upkeep, there is another point of IDOL that I terribly miss. IDOL logs tell you exactly what is happening. With Solr, I found it to be much much more obscure. For example, if the core is performing some internal indexing operation, there is absolutely no status reporting, it’s the equivalent of kicking off an action and just waiting … log file equivalent of those old school Windows hourglass that just spun around and round. If indexing actions weren't bad enough, wait until a core goes down, recovery kicks in and the customer asks you how long it will take.

I also miss some of IDOL’s terms functionality, when I first started working with IDOL, I read the sales material and thought to myself.. “yea, right….”. Years of sweating (or shivering) in server rooms and having to deliver on sale’s people’s imagination made me a natural skeptic when it comes to these kinds of claims.

However, after a while, I was able to validate these claims of conceptual understanding myself. With IDOL, you are able to scan a piece of text for the best terms that represent it in comparison to other documents in the index. It sounds like TF-IDF, but it is much much more than that. I am not sure how it works, but it came in handy when I was trying to group similar types of content together.

While we are talking about terms, I think I should mention IDOL stemming which works better than Porter and Snowball. In my book, they either do too much or not enough. IDOL’s stemming is still the best I’ve seen. 

The last and very important point is that there are few relevance configurations that I miss.  First of all, in IDOL document scores in the results are relative. In Solr, they are not because of QueryNorm, in Solr you can't perform any kind of meaningful comparison between two sets of results. IDOL does this, with IDOL you can run multiple queries and compare the scores of the results with each other. Now, I know what you are thinking... Why would anyone want to do that? Well, when it comes down to grouping related content to finding conceptually similar stuff this is really important.

Sorry for the trip down memory lane, just couldn’t resist it. There are a few other items on my wishlist for Solr, I will cover them in the next posts. I hope you guys are getting some value from this, leave a comment or two to let me know what you think.