Thursday, July 3, 2014

IDOL To Solr Migration Lessons Learned - Part 3 Data Processing



This post is a quick follow up to the last one in which I discussed several differences between IDOL and Solr when it comes to indexing data. In this post we will explore how different products maintain the data and what they allow you to do with it.

With IDOL were able to set certain fields for various searches, such as IndexFields for text indexing, or parametric for other types of searches. Any kind of data massaging had to happen before indexing to IDOL; via custom development or a Lua script in CFS. This certainly got the job done, but it didn’t even come close to the flexibility that Solr offers. With Solr, we customized processing for each field using Solr’s built in analyzers and tokenizes. This treats each field independently and tokenize (create individual searchable elements) by field. After tokenization, we run several analyzers on fields. Analyzers allow you to massage the data a bit before it is indexed; things like stemming, synonyms, minimum token length filtering occurs at this stem.

For example, we have a filed for manufacturer part number, the data comes in a variety of different formats and no single rule would accommodate all manufacturers and their favorite preference for part number formats. However, we were able to take it up a notch, but using the Word Delimiter Filter (Analyzer) to massage a bit each token in the manufacturer part number field. The word delimiter filter is configured to break up a token at transition from numeric to characters and vice versa. This breaks up patters such as FLV12345X, then, the length filter is applied to filter out anything less than two characters, so this field would be indexed as FLV12345X (original term is kept), FLV and 12345.  If a user searches for a part number as it is indexed, there will be a hit on three terms in one field, all of these terms should hit. Since the field is weighted highly, that almost guarantees that the product will be displayed. This came in handy, since we don’t have control over our data and get even the same part numbers in different formats from each vendor.

An even cool part, is that you can specify two sets of processes for each field, index and query level processing. This allows you to run a specific process on data when you are indexing it, and then a slightly different process when you are querying it. This came in handy for synonyms as we didn’t want to expand the query to contain synonyms, since some of them are abbreviations and they could dilute the search results, so instead we just included them at index time and indexed a few extra terms for each picked up synonym.

Solr’s list of tokenizes and analyzers can be found here: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Don’t worry if the list looks intimidating at first, Solr’s admin interface lets you test how various strings will look like with different field configurations. This allows you to easily test any field level processing without having to drum up some dummy content, index it, and then search for it until you figure out what is what. The Analyzer tab in Solr UI lets you see exactly what will happen with your data when you index it.

To sum up and keep it short: Solr is definitely a winner in this category. IDOL has some other cool functionality, especially when dealing with conceptual understanding of various terms, but it's not the same as this awesomeness of Solr field level processing.

I hope you found this post informative, I think that field processing is by far the coolest thing about Solr; but there are several others that are pretty close behind it. If you liked this post and want to get future posts by me, remember to sign up in the top right.

2 comments:

  1. The real list of analyzers is even longer than the one on the wiki. :-) http://www.solr-start.com/info/analyzers/

    ReplyDelete