Wednesday, June 25, 2014

IDOL To Solr Migration Lessons Learned - Part 2 Indexing Data

Since the data we were indexing into IDOL was plain old XML and we had a rather custom rig for indexing content we were easily able to modify it to generate some JSON files in a format that Solr understands. The only tweak we had to make is flatten our data structure since Solr didn't supported nested elements like IDOL. All in all, not a big loss and in retrospect this kind of simplified things. However, the fact that it wasn't supported was kind of odd at first. Lets examine how both products index content and some of my favorite things about each one.

In IDOL world, we index content through the DIH that distributes the data, then each content engine indexes the data and it sits in the queue until the content engine performs a DRESYNC operation to commit it to the index. The search performance drag of DRESYNC was unacceptable in regular production operation. It would take forever and searches would slow down to a crawl. However, since we had a mirrored set-up, we could easily mark one of the content servers offline at the DAH (where queries come from), DRESYNC it while it’s counterpart responds to searches, then bring it up and perform the same operation on the other content engine. Other maintenance activities like DRECOMPACT were executed in a similar manner, naturally it was all automated and didn’t give us much trouble.

IDOL is generally used index unstructured content; binary documents such as PDF, MS-Word, Excel, HTML, etc... Through processing it will extract all document level metadata and content and index it in IDOL. Additionally, indexing is a distributed process, the server does not do ALL the heavy lifting. Generally, a repository specific connector (there is about 400 of them) will pick up the file and send it to Content Framework Server (CFS). CFS will process the document and run things like, text extraction, pre/post processing tasks, custom Lua scripts to massage the data and etc... Once everything is complete, it will send the data to IDOL for final indexing. As soon as IDOL receives the file, client's (CFSs) job is done and it is freed back into the wild for additional index tasks, even if the file is not indexed yet. IDOL then would index the content, but not commit it to the index until DRESYNC is ran.

With Solr, we were pleasantly surprised since the indexing and committing operations did not drag down the search performance, indexing was also a lot faster than IDOL. We used a very aggressive soft and hard commit policy that would commit all content within a few seconds of indexing. With Solr we no longer required this maintenance policy and can index content throughout the day without noticeably impacting performance. This allowed us to process large quantities of updates and changes during the day if there was a large backlog. 

The way solr commits the data to the index is pretty cool. In a nutshell, it will open a new segment of the index and write data to it without impacting the existing index. The frequency of writing indexed data from memory to disk is controlled by the hard commit interval. The frequency of committing data to the search index and making it searchable is a soft commit.  We were able to get optimal performance with 15 second hard commits and 5 minute soft commits. Once the soft commit is triggered, Solr will do whatever it needs to do with the data and open a new Searcher process that will be able to search the entire index, old segments and the new data that was indexed. Since the data indexing is not directly modifying the searcher process, search requests are not impacted by indexing. 
Where IDOL Wins:
When a Solr client POSTs files to the server, the indexing begins immediately, while the client connection is still open. This is not something I particularly like and I think this is one of the things that IDOL got right, with IDOL you are able to send all the data to the server and let it process the data when convenient. This frees up the client for additional work.

Additionally, IDOL allows you to configure multiple connector and CFS instances to distribute the indexing load across many systems. This is critical for some of the larger implementations where you are dealing with several terabytes of data, and indexing all data can take weeks or months.

Another win for IDOL is the amount of custom development required to index data. With IDOL, you can deploy, install and start indexing without any custom development, right out of the box. With Solr, you will need to write something to format the data in specific JSON format and send it to Solr for indexing. Additionally, Apache Tika provides a text extraction library that can extract text from binary formats and include it with your content. After this, you only need to develop something that crawls a repository or a filesystem location for new/changed files and indexes it into Solr.

Where Solr Wins:
Solr scores a few points in this category. First of all, indexing was a lot faster than IDOL, indexing all of our data took approximately 6 hours in IDOL, while with Solr we were able to cut that time down to about 4 hours.

Second, the index disk sizes were a lot smaller, somewhere in the range of 60% of what IDOL had to use.

Third, and I will mention this more later, Solr provides a lot of flexibility with data processing through different tokenizers and analyzers.

As always, I try to not play favorites here, I truly believe each product does what it does extremely well and depending on your specific requirements, one may work better than the other. Stay tuned, in the next post I will explore what each product allows you to do with the data once it is indexed.