This post is a quick follow up to the last
one in which I discussed several differences between IDOL and Solr when it
comes to indexing data. In this post we will explore how different products
maintain the data and what they allow you to do with it.
With IDOL were able to set certain fields
for various searches, such as IndexFields for text indexing, or parametric for
other types of searches. Any kind of data massaging had to happen before
indexing to IDOL; via custom development or a Lua script in CFS. This certainly
got the job done, but it didn’t even come close to the flexibility that Solr
offers. With Solr, we customized processing for each field using Solr’s built
in analyzers and tokenizes. This treats each field independently and tokenize
(create individual searchable elements) by field. After tokenization, we run
several analyzers on fields. Analyzers allow you to massage the data a bit
before it is indexed; things like stemming, synonyms, minimum token length
filtering occurs at this stem.
For example, we have a filed for manufacturer
part number, the data comes in a variety of different formats and no single
rule would accommodate all manufacturers and their favorite preference for part
number formats. However, we were able to take it up a notch, but using the Word
Delimiter Filter (Analyzer) to massage a bit each token in the manufacturer
part number field. The word delimiter filter is configured to break up a token
at transition from numeric to characters and vice versa. This breaks up patters
such as FLV12345X, then, the length filter is applied to filter out anything
less than two characters, so this field would be indexed as FLV12345X (original
term is kept), FLV and 12345. If a user
searches for a part number as it is indexed, there will be a hit on three terms
in one field, all of these terms should hit. Since the field is weighted
highly, that almost guarantees that the product will be displayed. This came in
handy, since we don’t have control over our data and get even the same part
numbers in different formats from each vendor.
An even cool part, is that you can specify
two sets of processes for each field, index and query level processing. This
allows you to run a specific process on data when you are indexing it, and then
a slightly different process when you are querying it. This came in handy for
synonyms as we didn’t want to expand the query to contain synonyms, since some
of them are abbreviations and they could dilute the search results, so instead
we just included them at index time and indexed a few extra terms for each
picked up synonym.
Solr’s list of tokenizes and analyzers can
be found here: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Don’t worry if the list looks intimidating
at first, Solr’s admin interface lets you test how various strings will look
like with different field configurations. This allows you to easily test any
field level processing without having to drum up some dummy content, index it, and
then search for it until you figure out what is what. The Analyzer tab in Solr
UI lets you see exactly what will happen with your data when you index it.
To sum up and keep it short: Solr is
definitely a winner in this category. IDOL has some other cool functionality, especially when dealing with conceptual understanding of various terms, but it's not the same as this awesomeness of Solr field level processing.
I hope you found this post informative, I
think that field processing is by far the coolest thing about Solr; but there
are several others that are pretty close behind it. If you liked this post and
want to get future posts by me, remember to sign up in the top right.
The real list of analyzers is even longer than the one on the wiki. :-) http://www.solr-start.com/info/analyzers/
ReplyDeleteCool, thanks for the link!
ReplyDelete