Sunday, July 27, 2014

IDOL To Solr Migration Lessons Learned - Part 5 Performance Tests Results

In this post, I will cover some lessons learned and key differences in performance of the two systems. Throughout the project, we’ve literally ran hundreds of tests to gauge stability and performance of both solutions under various conditions they would live in and the results were not always clear cut. Overall, it is hard to say which product performs better, but after some tweaking, we were able to get Solr to perform as good or better, in some cases, than IDOL; but we had to compromise on some functionality. 

First, let’s loop back to the requirements and cover architectures of both solutions. We have two servers, each one holds 4 content engines and a DAH. The content engines on the second server are a mirror copy of the data on the first server. This was replicated in SolrCloud, we deployed 4 shards with two replicas for each one. We also had another DAH that pointed to the two lower DAHs for queries in a round-robin fashion, so basically, all queries got load balanced across a set of four content engines on either box. This provided good performance as well as high availability. One point to re-iterate was that we were using IDOL 7, which is a few years behind. With SolrCloud, we didn’t have DAHs or DIHs, queries were went from the client to any of the replicas and they sorted out the result set and returned it back to the client. 

Now, let’s lay out our testing scenarios. We had a list of somewhere around 2,000 most frequent user queries that we used to hammer the systems. Our front end also supports multiple types of searches and each search can generate a few additional calls to Solr, all these were included in tests and although each call was separate, the times for execution were combined for related requests to get a bigger picture. Each test consisted of several load levels, most tests started at around 10,000 queries an hour and increased by 5,000 or 10,000 each 30 minutes. The maximum capacity we tested was approximately 60,000 queries per hour. 

Since our performance goal was to do as good as IDOL, we started with defining IDOL performance by running a few tests. IDOL performance was stellar for each individual query, however because in IDOL facet requests are a separate call, we had to combine the time for both requests. This added some overhead, but combined they were still was less than half a second.

While IDOL tests were running, we were monitoring query performance as well as system utilization. The system utilization of RAM and CPU was essentially linear, each content engine was given about 10GB of RAM for caching and internal workings and it never surpassed that. The CPU utilization on these 128 core boxes was almost always below 25%, even at peak load. 

As we pushed the system beyond our initial design requirements, we noticed a sudden increase in query lag. This was due to our specified thread count for each DAH, the threads were maxed out and as such searches had to wait a little while to be executed. Naturally, if this was a production environment we would deploy a few more DAHs to support increase in search, but since it wasn’t our testing objective and we were short on time, we decided to just call it a day. The main element here is that the IDOL content engines weren’t really breaking a sweat throughout most of these tests. 

We repeated all of these tests with Solr, with necessary adjustment for internal workings on the system. However, it is worth to note that all business rules, fields, weights and etc… were preserved and replicated in Solr as close we as could match them. 

As we begun to test Solr, with our limited experience in the system, we immediately started to run into problems. The initial and most critical problem was system stability; Solr kept crashing during our tests. The culprit ended up being Java Garbage Collection pauses which caused SolrCloud to time out from ZooKeeper. Our handy consultants were able to help us address this issue, however the documentation on this kind of problem is minimal at best. We would never have been able to track this as a root cause on our own. 

If you are running into stability issues with Solr, Garbage Collection is definitely worth investigating. One of confirming symptoms is that there is some significant pauses in the log file with no activity whatsoever followed by a time out from zookeeper and a few funny error messages. 

Once the stability issues were addressed we had to examine query performance. We were getting some very mixed results on some queries and it took us a little while to determine the exact cause. We were seeing generally good performance, better than IDOL, but on some queries the system would just stall out and take between 3 and 5 seconds to respond. 

This is where we had to sacrifice some functionality. We have a unique identifier that is shared between several documents; when searching, we generally only want to see the data related to the unique identifier, however data from individual items is searchable as well. In both systems, this kind of grouping it is a piece of cake since we can easily group by a field value, however the culprit was getting facet counts based on this grouping. For some of the broader terms, the result sets returned were pretty large, somewhere around a million and Solr had to calculate facet counts based on groups which caused this delay. We had to get customer’s approval to turn off this functionality and were able to improve Solr’s performance at a slight loss in functionality. 

We also measured system performance during Solr’s tests. I think it was worse than IDOL’s, the CPU utilization was never stable and resembled what I would imagine to be a crack head’s EKG chart.  JVM heap was configured at 32GB as well, totaling less than IDOL. I wish we could give it some more RAM, but due to garbage collection problems we were advised against it. 

In the end, we were able to shave off about ¼ of a second for almost all of our searches on average, which is a pretty tangible improvement in my opinion. However, we never went back to test IDOL’s performance without counting up facets based on groups, so I can’t say of certain which one is faster.
So here is my take on it. As far as performance goes, I think Solr is faster than IDOL 7, mainly because you can retrieve facets and results in the same query. However, as far as performance bang for the buck, I think IDOL does a much better job here. It’s hardware resource utilization is much more controlled and predictable. 

Additionally, I my personal belief is that SolrCloud automates too much which turns into a lack of flexibility and control. For example, in IDOL, we had direct control over which content servers will be processing requests. While in SolrCloud, requests for each shard are distributed among replicas in an unpredictable manner. Don’t get me wrong, I am all in favor of automation, but the control freak in me is screaming “NOOOOOOOO!!!!”. 

I know that there are a lot of elements missing here since query performance and system utilization is heavily dependent on data and types of queries and I didn’t share that information with you yet, but don’t worry, more info will be coming soon.