Wednesday, June 25, 2014

IDOL To Solr Migration Lessons Learned - Part 2 Indexing Data


Since the data we were indexing into IDOL was plain old XML and we had a rather custom rig for indexing content we were easily able to modify it to generate some JSON files in a format that Solr understands. The only tweak we had to make is flatten our data structure since Solr didn't supported nested elements like IDOL. All in all, not a big loss and in retrospect this kind of simplified things. However, the fact that it wasn't supported was kind of odd at first. Lets examine how both products index content and some of my favorite things about each one.

In IDOL world, we index content through the DIH that distributes the data, then each content engine indexes the data and it sits in the queue until the content engine performs a DRESYNC operation to commit it to the index. The search performance drag of DRESYNC was unacceptable in regular production operation. It would take forever and searches would slow down to a crawl. However, since we had a mirrored set-up, we could easily mark one of the content servers offline at the DAH (where queries come from), DRESYNC it while it’s counterpart responds to searches, then bring it up and perform the same operation on the other content engine. Other maintenance activities like DRECOMPACT were executed in a similar manner, naturally it was all automated and didn’t give us much trouble.

IDOL is generally used index unstructured content; binary documents such as PDF, MS-Word, Excel, HTML, etc... Through processing it will extract all document level metadata and content and index it in IDOL. Additionally, indexing is a distributed process, the server does not do ALL the heavy lifting. Generally, a repository specific connector (there is about 400 of them) will pick up the file and send it to Content Framework Server (CFS). CFS will process the document and run things like, text extraction, pre/post processing tasks, custom Lua scripts to massage the data and etc... Once everything is complete, it will send the data to IDOL for final indexing. As soon as IDOL receives the file, client's (CFSs) job is done and it is freed back into the wild for additional index tasks, even if the file is not indexed yet. IDOL then would index the content, but not commit it to the index until DRESYNC is ran.

With Solr, we were pleasantly surprised since the indexing and committing operations did not drag down the search performance, indexing was also a lot faster than IDOL. We used a very aggressive soft and hard commit policy that would commit all content within a few seconds of indexing. With Solr we no longer required this maintenance policy and can index content throughout the day without noticeably impacting performance. This allowed us to process large quantities of updates and changes during the day if there was a large backlog. 

The way solr commits the data to the index is pretty cool. In a nutshell, it will open a new segment of the index and write data to it without impacting the existing index. The frequency of writing indexed data from memory to disk is controlled by the hard commit interval. The frequency of committing data to the search index and making it searchable is a soft commit.  We were able to get optimal performance with 15 second hard commits and 5 minute soft commits. Once the soft commit is triggered, Solr will do whatever it needs to do with the data and open a new Searcher process that will be able to search the entire index, old segments and the new data that was indexed. Since the data indexing is not directly modifying the searcher process, search requests are not impacted by indexing. 
Where IDOL Wins:
When a Solr client POSTs files to the server, the indexing begins immediately, while the client connection is still open. This is not something I particularly like and I think this is one of the things that IDOL got right, with IDOL you are able to send all the data to the server and let it process the data when convenient. This frees up the client for additional work.

Additionally, IDOL allows you to configure multiple connector and CFS instances to distribute the indexing load across many systems. This is critical for some of the larger implementations where you are dealing with several terabytes of data, and indexing all data can take weeks or months.

Another win for IDOL is the amount of custom development required to index data. With IDOL, you can deploy, install and start indexing without any custom development, right out of the box. With Solr, you will need to write something to format the data in specific JSON format and send it to Solr for indexing. Additionally, Apache Tika provides a text extraction library that can extract text from binary formats and include it with your content. After this, you only need to develop something that crawls a repository or a filesystem location for new/changed files and indexes it into Solr.

Where Solr Wins:
Solr scores a few points in this category. First of all, indexing was a lot faster than IDOL, indexing all of our data took approximately 6 hours in IDOL, while with Solr we were able to cut that time down to about 4 hours.

Second, the index disk sizes were a lot smaller, somewhere in the range of 60% of what IDOL had to use.

Third, and I will mention this more later, Solr provides a lot of flexibility with data processing through different tokenizers and analyzers.

As always, I try to not play favorites here, I truly believe each product does what it does extremely well and depending on your specific requirements, one may work better than the other. Stay tuned, in the next post I will explore what each product allows you to do with the data once it is indexed.

Thursday, June 19, 2014

IDOL To Solr Migration Lessons Learned - Part 1 Requirements



Recently my organization completed a full IDOL 7 to SOLR migration for an e-commerce application and I wanted to share some of the lessons learned with you as well as my take on Solr and IDOL. If you’ve been reading my blog, you know that I am very keen on requirements, so let’s start with that.

  • Document Count: Around 40 million
  • Churn: Anywhere between 10,000 and 200,000 a night, must be capable of indexing/changing a million products over a weekend.
  • Performance: 75% of searches in IDOL were under half a second, next 20% were under ¾ of a second, the rest took a bit longer. Generally, we see anywhere between 200,000 and 300,000 searches per day. They peak between 9am and 4pm, at which point, we see between 20,000 and 30,000 an hour, with a historical maximum of around 46,000 . Solr was to perform better than IDOL.
  • Relatively heavy use of grouping, faceting and sorting.
  • Hardware: Two servers (for HA), each one with 128 cores and 315 GB of RAM. We also had dedicated storage for each collection/idol content engine.
Naturally, there were other requirements, but for the purpose of this series we will stick with this.

Monday, June 16, 2014

IDOL OnDemand



Autonomy (now HP) Intelligent Data Operating Layer (IDOL) has been a very interesting piece of software to work with over the last few years. Naturally, I was a bit skeptical about its advertised capabilities when I first started my current job, but I’ve been pleasantly surprised. Now, HP has made it possible for people to try out IDOL functionality in the wild for FREE! What’s even cooler is that they have some abstract APIs available for developers to use in building some custom apps. Now, IDOL OnDemand is not the actual IDOLServer that folks can buy, at least not yet, but regardless, the APIs are backed up the same IDOL functionality. Give it a shot here! 

https://www.idolondemand.com/