In this post, I will cover some lessons learned and key
differences in performance of the two systems. Throughout the project, we’ve literally
ran hundreds of tests to gauge stability and performance of both solutions
under various conditions they would live in and the results were not always
clear cut. Overall, it is hard to say which product performs better, but after
some tweaking, we were able to get Solr to perform as good or better, in some
cases, than IDOL; but we had to compromise on some functionality.
First, let’s loop back to the requirements and cover
architectures of both solutions. We have two servers, each one holds 4 content
engines and a DAH. The content engines on the second server are a mirror copy
of the data on the first server. This was replicated in SolrCloud, we deployed
4 shards with two replicas for each one. We also had another DAH that pointed
to the two lower DAHs for queries in a round-robin fashion, so basically, all
queries got load balanced across a set of four content engines on either box.
This provided good performance as well as high availability. One point to
re-iterate was that we were using IDOL 7, which is a few years behind. With
SolrCloud, we didn’t have DAHs or DIHs, queries were went from the client to
any of the replicas and they sorted out the result set and returned it back to
the client.
Now, let’s lay out our testing scenarios. We had a list of somewhere
around 2,000 most frequent user queries that we used to hammer the systems. Our
front end also supports multiple types of searches and each search can generate
a few additional calls to Solr, all these were included in tests and although
each call was separate, the times for execution were combined for related
requests to get a bigger picture. Each test consisted of several load levels,
most tests started at around 10,000 queries an hour and increased by 5,000 or
10,000 each 30 minutes. The maximum capacity we tested was approximately 60,000
queries per hour.
Since our performance goal was to do as good as IDOL, we
started with defining IDOL performance by running a few tests. IDOL performance
was stellar for each individual query, however because in IDOL facet requests
are a separate call, we had to combine the time for both requests. This added
some overhead, but combined they were still was less than half a second.
While IDOL tests were running, we were monitoring query
performance as well as system utilization. The system utilization of RAM and
CPU was essentially linear, each content engine was given about 10GB of RAM for
caching and internal workings and it never surpassed that. The CPU utilization on
these 128 core boxes was almost always below 25%, even at peak load.
As we pushed the system beyond our initial design
requirements, we noticed a sudden increase in query lag. This was due to our
specified thread count for each DAH, the threads were maxed out and as such
searches had to wait a little while to be executed. Naturally, if this was a
production environment we would deploy a few more DAHs to support increase in
search, but since it wasn’t our testing objective and we were short on time, we
decided to just call it a day. The main element here is that the IDOL content
engines weren’t really breaking a sweat throughout most of these tests.
We repeated all of these tests with Solr, with necessary
adjustment for internal workings on the system. However, it is worth to note
that all business rules, fields, weights and etc… were preserved and replicated
in Solr as close we as could match them.
As we begun to test Solr, with our limited experience in the
system, we immediately started to run into problems. The initial and most
critical problem was system stability; Solr kept crashing during our tests. The
culprit ended up being Java Garbage Collection pauses which caused SolrCloud to
time out from ZooKeeper. Our handy consultants were able to help us address
this issue, however the documentation on this kind of problem is minimal at
best. We would never have been able to track this as a root cause on our own.
If you are running into stability issues with Solr, Garbage
Collection is definitely worth investigating. One of confirming symptoms is that
there is some significant pauses in the log file with no activity whatsoever
followed by a time out from zookeeper and a few funny error messages.
Once the stability issues were addressed we had to examine
query performance. We were getting some very mixed results on some queries and
it took us a little while to determine the exact cause. We were seeing
generally good performance, better than IDOL, but on some queries the system
would just stall out and take between 3 and 5 seconds to respond.
This is where we had to sacrifice some functionality. We
have a unique identifier that is shared between several documents; when
searching, we generally only want to see the data related to the unique
identifier, however data from individual items is searchable as well. In both systems,
this kind of grouping it is a piece of cake since we can easily group by a
field value, however the culprit was getting facet counts based on this
grouping. For some of the broader terms, the result sets returned were pretty
large, somewhere around a million and Solr had to calculate facet counts based
on groups which caused this delay. We had to get customer’s approval to turn
off this functionality and were able to improve Solr’s performance at a slight
loss in functionality.
We also measured system performance during Solr’s tests. I think
it was worse than IDOL’s, the CPU utilization was never stable and resembled
what I would imagine to be a crack head’s EKG chart. JVM heap was configured at 32GB as well, totaling
less than IDOL. I wish we could give it some more RAM, but due to garbage
collection problems we were advised against it.
In the end, we were able to shave off about ¼ of a second
for almost all of our searches on average, which is a pretty tangible
improvement in my opinion. However, we never went back to test IDOL’s
performance without counting up facets based on groups, so I can’t say of
certain which one is faster.
So here is my take on it. As far as performance goes, I
think Solr is faster than IDOL 7, mainly because you can retrieve facets and
results in the same query. However, as far as performance bang for the buck, I
think IDOL does a much better job here. It’s hardware resource utilization is
much more controlled and predictable.
Additionally, I my personal belief is that SolrCloud
automates too much which turns into a lack of flexibility and control. For
example, in IDOL, we had direct control over which content servers will be
processing requests. While in SolrCloud, requests for each shard are
distributed among replicas in an unpredictable manner. Don’t get me wrong, I am
all in favor of automation, but the control freak in me is screaming “NOOOOOOOO!!!!”.
I know that there are a lot of elements missing here since
query performance and system utilization is heavily dependent on data and types
of queries and I didn’t share that information with you yet, but don’t worry,
more info will be coming soon.