Text casing and Examine

by Aaron Powell 23. August 2010 15:41

A few times I’ve seen questions posted on the Umbraco forums which ask how to deal with case insensitivity text with Examine, and it’s also something that we’ve had to handle a few times within our own company.

Here’s a scenario:

  • You have a site search
  • You use examine
  • You want to show the results looking exactly the same as it was before it went into Examine

If you’re running a standard install you’ll notice that the content always ends up lowercased!

This is a bit of a problem, page titles will be lowercase, body content will be lowercase, etc. Part of this will be due to a mistake in Examine, part of it is due to the design of Lucene.

In this article I’ll have a look at what you need to do to make it work as you’d expect.

First, some background

Before we dive directly into what to do to fix it you really should understand what is happening. If you don’t care feel free to skip over this bit though :P.

Searching is a tricky thing, and when searching the statement Examine == examine = false; To get around this searching is best done in a case insensitive manner. To make this work Examine did a forced lowercase of the content before it was pushed into Lucene.Net. This was to ensure that everything was exactly the same when it was searched against.
In hindsight this is not really a great idea, it really should be the responsibility of the Lucene Analyzer to handle this for you.

Many of the common Lucene.Net analyzers actually do automatic lowercasing of content, these analysers are:

  • StandardAnalyzer
  • StopAnalyzer
  • SimpleAnalyzer

So if you’re using the standard Examine config you’ll find yourself using the StandardAnalyzer and still have your content lowercased.

This means that there’s no need to Lucene to concern itself about case sensitivity when searching, everything is parsed by the analyzer (field terms and queries) and you’ll get more matches.

So how do I get around this?

Now that we’ve seen why all your content is generally lower case, how can we work with it in the original format and display it back to the UI?

Well we need some way in which we can have the field data stored without the analyzer screwing around with it.

Note: This doesn’t need to be done if you’re using an analyzer which doesn’t have a LowerCaseTokenizer or LowercaseFilter. If you’re using a different analyzer, like KeywordAnalyzer then this post wont cover what you’re after (since the KeywordAnalyzer isn’t lowercasing, you’re actually using an out-dated version of Examine, I recommend you grab the latest release :)). More information on Analyzers can be found at http://www.aaron-powell.com/lucene-analyzer

Luckily we’ve got some hooks into Examine to allow us to do what we need here, it’s in the form of an event on the Examine.LuceneEngine.Providers.LuceneIndexer, called DocumentWriting. Note that this event is on the LuceneIndexer, not the BaseIndexProvider. This event is Lucene.Net specific and not logical on the base class which is agnostic of any other framework.

What we can do with this event is interact directly with Lucene.Net while Examine is working with it.
You’ll need to have a bit of an understanding of how to work with a Lucene.Net Document (and for that I’d recommend having a read of this article from me: http://www.aaron-powell.com/documents-in-lucene-net), cuz what you’re able to do is play with Lucene.Net… Feel the power!

So we can attach the event handler the same way as you would do any other event in Umbraco, using an Action Handler:

public class UmbracoEvents : ApplicationBase
{
	public UmbracoEvents()
        {
            var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["DefaultIndexer"];

            indexer.DocumentWriting +=new System.EventHandler(indexer_DocumentWriting);
        }
}

To do this we’ve got to cast the indexer so we’ve got the Lucene version to work with, then we’re attaching to our event handler. Let’s have a look at the event handler

void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
	//grab out lucene document from the event arguments
	var doc = e.Document;

	//the e.Fields dictionary is all the fields which are about to be inserted into Lucene.Net
	//we'll grab out the "bodyContent" one, if there is one to be indexed
	if(e.Fields.ContainsKey("bodyContent")) 
	{
		string content = e.Fields["bodyContent"];
		//Give the field a name which you'll be able to easily remember
		//also, we're telling Lucene to just put this data in, nothing more
		doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));
	}
}

And that’s how you can push data in. I’d recommend that you do a conditional check to ensure that the property you’re looking for does exist in the Fields property of the event args, unless you’re 100% sure that it appears on all the objects which you’re indexing.

Lastly we need to display that on the UI, well it’s easy, rather accessing the bodyContent property of the SearchResults, use the __bodyContent and you’ll get your unanalyzed version.

Conclusion

Here we’ve looked at how we can use the Examine events to interact with the Lucene.Net Document. We’ve decided that we want to push in unanalyzed text, but you could use this idea to really tweak your Lucene.Net document. But really playing with the Document is not recommended unless you *really* know what you’re doing ;).

Categories: .Net | Examine | Umbraco

Comments

8/24/2010 12:55:20 PM #

In umbraco 4.5 this example doesn't work, this is exactly what im im looking for but just cant get any of it to work in umbraco 4.5 :(

RoboDog United Kingdom

8/24/2010 5:32:17 PM #

Never mind got it working now lol :)

RoboDog United Kingdom

8/27/2010 7:24:36 PM #

Hi,

First of all, Examine rocks!
However I can't get the above example to work with Umbraco 4.5 Updated to Examine RC3.

The code I'm using:

void indexer_DocumentWriting(object sender, Examine.LuceneEngine.DocumentWritingEventArgs e)
        {
            var doc = e.Document;

            foreach (var f in e.Fields)
            {
                doc.Add(
                    new Field(
                        string.Concat("u_", f.Key),
                        f.Value,
                        Field.Store.YES,
                        Field.Index.NOT_ANALYZED
                        )
                        );
            }
        }

The event gets triggerd but there are no changes to the searchresults.
Any help would be much appreciated!

Kind regards

Ron Brouwer

Ron Brouwer Netherlands

8/30/2010 2:48:05 AM #

Are you using the custom field as the display property for your search results? Is the custom field returning in your SearchResult objects?

AaronPowell United States

8/30/2010 3:17:00 AM #

What term are you searching on (and what case is it) and what Analyzer are you using?

ShannonDeminick Australia

8/30/2010 11:34:50 AM #

The results that I get are correct so I don't have a problem with the search itself. I expect the extra fields that are added using the above event to be available in my searchresult. When I look at the individual searchresults, the fields [string.Concat("u_", f.Key)] are not in the fields collection.

Ron Brouwer Netherlands

8/30/2010 12:02:19 PM #

Have you tried inspecting the Lucene index with Luke (http://code.google.com/p/luke/)? This will tell you if the custom field you added is actually in the index.

When a document is being 'hydrated' out of the Lucene store it will read each field which is available for that document and then add it to the search results object.

You should have each field twice then, assuming that the data has actually made it into Lucene.

AaronPowell Australia

8/30/2010 12:35:27 PM #

Examine returns:

Count = 11
    [0]: {[name, Dummy]}
    [1]: {[id, 1105]}
    [2]: {[nodeName, 115f58e9-bbe3-4a21-9428-592901dfa010]}
    [3]: {[updateDate, 20100827171303000]}
    [4]: {[writerName, Administrator]}
    [5]: {[path, -1,1105]}
    [6]: {[nodeTypeAlias, image]}
    [7]: {[parentID, -1]}
    [8]: {[__NodeId, 1105]}
    [9]: {[__IndexType, content]}
    [10]: {[__Path, -1,1105]}

Luke returns:

stored/uncompressed,indexed,omitNorms<__IndexType:content>
stored/uncompressed,indexed,omitNorms<__NodeId:1105>
stored/uncompressed,indexed,omitNorms<__Path:-1,1105>
stored/uncompressed,indexed,termVector,omitNorms<createDate:20100826155341000>
stored/uncompressed,omitNorms<creatorID:0>
stored/uncompressed,indexed,termVector<creatorName:Administrator>
stored/uncompressed,indexed,termVector<id:1105>
stored/uncompressed,omitNorms<level:1>
stored/uncompressed,indexed,tokenized,termVector<name:Dummy>
stored/uncompressed,indexed,tokenized,termVector<nodeName:115f58e9-bbe3-4a21-9428-592901dfa010>
stored/uncompressed,indexed,termVector<nodeType:1064>
stored/uncompressed,indexed,termVector<nodeTypeAlias:image>
stored/uncompressed,omitNorms<parentID:-1>
stored/uncompressed,indexed,termVector<path:-1,1105>
stored/uncompressed,omitNorms<sortOrder:2>
stored/uncompressed,indexed,termVector<template:0>
stored/uncompressed,indexed<u___IndexType:content>
stored/uncompressed,indexed<u___NodeId:1105>
stored/uncompressed,indexed<u___Path:-1,1105>
stored/uncompressed,indexed<u_createDate:2010-08-26T15:53:41>
stored/uncompressed,indexed<u_creatorID:0>
stored/uncompressed,indexed<u_creatorName:Administrator>
stored/uncompressed,indexed<u_id:1105>
stored/uncompressed,indexed<u_level:1>
stored/uncompressed,indexed<u_name:Dummy>
stored/uncompressed,indexed<u_nodeName:115f58e9-bbe3-4a21-9428-592901dfa010>
stored/uncompressed,indexed<u_nodeType:1064>
stored/uncompressed,indexed<u_nodeTypeAlias:image>
stored/uncompressed,indexed<u_parentID:-1>
stored/uncompressed,indexed<u_path:-1,1105>
stored/uncompressed,indexed<u_sortOrder:2>
stored/uncompressed,indexed<u_template:0>
stored/uncompressed,indexed<u_updateDate:2010-08-27T17:13:03>
stored/uncompressed,indexed<u_urlName:115f58e9-bbe3-4a21-9428-592901dfa010>
stored/uncompressed,indexed<u_writerID:0>
stored/uncompressed,indexed<u_writerName:Administrator>
stored/uncompressed,indexed,termVector,omitNorms<updateDate:20100827171303000>
stored/uncompressed,indexed,termVector<urlName:115f58e9-bbe3-4a21-9428-592901dfa010>
stored/uncompressed,omitNorms<writerID:0>
stored/uncompressed,indexed,termVector<writerName:Administrator>


It seems that examine does not return all fields.
I'm also wondering if it is possible to store the html markup in the index because it seems to be removed.

Thanks.

Ron Brouwer Netherlands

8/30/2010 2:04:35 PM #

I changed the event to:

void indexer_DocumentWriting(object sender, Examine.LuceneEngine.DocumentWritingEventArgs e)
        {
            var lucDocument = e.Document;
            var umbDocument = new umbraco.cms.businesslogic.web.Document(Convert.ToInt32(e.Fields["__NodeId"]));

            foreach (var f in e.Fields)
            {
                var prop = umbDocument.getProperty(f.Key);
                if (prop != null)
                {
                    lucDocument.Add(
                        new Field(
                            string.Concat("prop_", f.Key),
                            prop.Value.ToString(),
                            Field.Store.YES,
                            Field.Index.NOT_ANALYZED_NO_NORMS
                            )
                            );
                }
            }
        }

I don't like to load the document but I think its the only way to get the html syntax. So now the html is stored but Examine still does not return the fields.

Any suggestions?

Ron Brouwer Netherlands

8/30/2010 5:38:47 PM #

I found the problem.

            var searchProvider = ExamineManager.Instance
                .SearchProviderCollection["MySearcher"];

            ISearchCriteria sc = searchProvider.CreateSearchCriteria(IndexTypes.Content,BooleanOperation.Or);

            IBooleanOperation query = null;
            
            // create query

            var result = searchProvider.Search(query.Compile());

Instead of using "var result = searchProvider.Search(query.Compile());"
I used "ExamineManager.Instance.Search(query.Compile());" which uses the default index.

Thanks for all the help.

Ron Brouwer Netherlands

8/31/2010 2:59:03 AM #

I'm glad you've figured it out :)

ShannonDeminick Australia

9/3/2010 12:06:04 PM #

Hi,

Many thanks for your post.

Will this code work against Umbraco 4.5.2 or has the version of Examine changed since this was written?

I'm having problems as the following line doesn't compile, any advice?

var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["DefaultIndexer"];

Does anyone have a code sample working under 4.5.2?

Many thanks

Rich

Rich United Kingdom

9/3/2010 12:32:02 PM #

Yes, the code will work fine in 4.5.2.

As for your compile error (without having any idea what the error you're getting is, as that should give you direction as to what needs fixing) ensure that you have the correct assemblies referenced in your project, and that you have the appropriate namespaces in your code file.

AaronPowell Australia

9/3/2010 1:26:25 PM #

Thanks Aaron,

Indeed the error is due to missing namespace

"The type or namespace name 'LuceneIndexer' could not be found (are you missing a using directive or an assembly reference?)"

Could you please advise which namespace needs to be referenced as I can't figure it out.

I've added a reference to Examine.dll, UmbracoExamine.dll & Lucene.Net

I've tried referencing all the following assemblies without success

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using umbraco.cms.businesslogic.web;
using umbraco.BusinessLogic;
using System.Configuration.Assemblies;

using Examine;
using Examine.Config;
using Examine.Providers;
using Examine.SearchCriteria;

using Lucene;
using Lucene.Net;
using Lucene.Net.Index;
using Lucene.Net.Analysis;
using Lucene.Net.Documents;
using Lucene.Net.Messages;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;

using UmbracoExamine;
using UmbracoExamine.Config;
using UmbracoExamine.SearchCriteria;

Many thanks!

Rich

Rich United Kingdom

9/6/2010 2:44:11 AM #

In Umbraco 4.5.2 you need to use the class LuceneExamineIndexer, not LuceneIndexer. That was a mistake by me as I was using the next version of Examine which separates Examine for Umbraco, refer to this post: farmcode.org/.../...arch-with-ANY-data-source.aspx

AaronPowell Australia

2/22/2011 9:53:08 PM #

I just noticed that in Umbraco Examine RC1 this is no longer an issue.  You can get your text out of the SearchResult object in it's original case with no further hassle.  

Danny United States

2/23/2011 3:06:57 AM #

Examine RC1 is old, please upgrade to the RTM release from CodePlex:
http://examine.codeplex.com

But not this is not entirely true, it depends on what analyzer you are using and whether or not its a case sensitive analyzer or not.

ShannonDeminick United States

2/25/2011 2:52:03 PM #

If you want to add the html to lucene you could also use the GatheringNodeData event. See this topic: our.umbraco.org/.../17737-Examine-returns-wrong-markup

Jeroen

Jeroen Breuer Netherlands

3/1/2011 2:51:24 AM #

Yes, but you would also need to add that field to your config. Using the DocumentWriting event allows you to directly manipulate the Lucene index and store the information in Lucene differently if you wanted to (i.e. Tokenized vs Non-Tokenized, with Norms, etc...)

ShannonDeminick Australia