2015-04-01

Sitecore 7 ContentSearch Tips

This is a WIP: I have a couple new features I want to write about on here, as well as clean up the examples a bit.

Recently I’ve been trying to implement Sitecore 7’s new ContentSearch features for a client and had to figure a lot of things out. Documentation has been sparse, misleading, and often times incomplete or obscure.

This article aims to list out the hard-won techniques I learned during the implementation phase, in hopes of alleviating the pain others are likely experiencing, and as a reminder to myself should I need to revisit this in the future.

We’re going to walk through setting up a hypothetical blog search feature, where we have the following information hierarchy:

Blog
- Title (text)
- Posted Date (Date)
- Body (rich text)
- Keywords (text)
- Associated Product (drop-down link to products)
- Tags (multi list)
Product
- Name (text)
- Description (rich text)
- Instructions (rich text)
Tags
- Name

Terminology

document - a single entity stored by Lucene: a product, a blog post, etc
fields - a property of a stored document
index - a set of documents stored by Lucene
stored field - a field in which the value is actually saved within Lucene
unstored field - a field in which Lucene indexes, but does not actually store the value. So Lucene would know that the field contained the word “foobar” (and possibly where in the string it was found) but could not give you back the whole field
tokenized field - a field that is broken up into terms (words). Use this if you want to search for documents with a field containing a term. The field value “foo bar baz” would be indexed as the list of “foo” “bar” and “baz”
untokenized field - a field that is stored and indexed as one single term. Use this when the field doesn’t make sense to split up, for example prices, dates, cities

Configuration

Install Luke for index viewing

You’re going to want to get Luke - Lucene Index Toolbox which is on Google code, at least until that gets shuttered. This will allow you to test queries, see what values are being indexed, and their tokenization status.

Initial setup

Make sure the ~/App_Config/Include/Sitecore.ContentSearch.* files are added to Visual Studio.

Create a new config called ~/App_Config/Include/Sitecore.ContentSearch.Lucene.BlogPosts.config

I’m going to paste the config in and then elaborate on parts of it. You can adapt it to your needs.

Sitecore.ContentSearch.Lucene.BlogPosts.config

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
      <configuration type="Sitecore.ContentSearch.ContentSearchConfiguration, Sitecore.ContentSearch">
        <indexes hint="list:AddIndex">
          <index id="blog_posts" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
            <param desc="name">$(id)</param>
            <param desc="folder">$(id)</param>
            <!-- This initializes index property store. Id has to be set to the index id -->
            <param desc="propertyStore" ref="contentSearch/databasePropertyStore" param1="$(id)" />
            <configuration ref="contentSearch/indexConfigurations/blogIndexConfiguration" />
            <strategies hint="list:AddStrategy">
              <!-- NOTE: order of these is controls the execution order -->
              <strategy ref="contentSearch/indexUpdateStrategies/onPublishEndAsync" />
              <strategy ref="contentSearch/indexUpdateStrategies/intervalAsyncCore">
                <param desc="database">web</param>
                <param desc="interval">01:00:00</param><!-- rebuild every hour -->
              </strategy>
            </strategies>
            <commitPolicyExecutor type="Sitecore.ContentSearch.CommitPolicyExecutor, Sitecore.ContentSearch">
              <policies hint="list:AddCommitPolicy">
                <policy type="Sitecore.ContentSearch.TimeIntervalCommitPolicy, Sitecore.ContentSearch" />
              </policies>
            </commitPolicyExecutor>
            <locations hint="list:AddCrawler">
              <crawler type="Sitecore.ContentSearch.SitecoreItemCrawler, Sitecore.ContentSearch">
                <Database>web</Database>
                <Root>/sitecore/content/BlogPosts</Root>
              </crawler>
            </locations>
          </index>
        </indexes>
      </configuration>
      <indexConfigurations>
        <blogIndexConfiguration type="Sitecore.ContentSearch.LuceneProvider.LuceneIndexConfiguration, Sitecore.ContentSearch.LuceneProvider">
          <indexAllFields>false</indexAllFields>
          <initializeOnAdd>true</initializeOnAdd>
          <analyzer ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/analyzer" />
          <fieldMap type="Sitecore.ContentSearch.FieldMap, Sitecore.ContentSearch">
            <fieldNames hint="raw:AddFieldByFieldName">

              <!-- you must have _uniqueid or you wont be able to update the document later -->
              <field fieldName="_uniqueid"            storageType="YES" indexType="TOKENIZED"    vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">
                <analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" />
              </field>

              <field fieldName="title"         storageType="YES" indexType="TOKENIZED"    vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
              <field fieldName="body"          storageType="YES" indexType="TOKENIZED"    vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
              <field fieldName="keywords"      storageType="YES" indexType="TOKENIZED"    vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
              <field fieldName="tags"          storageType="YES" indexType="TOKENIZED"    vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
              <field fieldName="posted date"   storageType="YES" indexType="UNTOKENIZED"  vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
              <field fieldName="sorting_title" storageType="YES" indexType="TOKENIZED"    vectorType="YES" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />

            </fieldNames>
          </fieldMap>
          <fields hint="raw:AddComputedIndexField">
            <field fieldName="product_name">MyApp.Indexing.ComputedFields.ProductName, MyApp</field>
            <field fieldName="product_description">MyApp.Indexing.ComputedFields.ProductDescription, MyApp</field>
            <field fieldName="product_instructions">MyApp.Indexing.ComputedFields.ProductInstructions, MyApp</field>
            <field fieldName="tag_names">MyApp.Indexing.ComputedFields.TagNames, MyApp</field>
            <field fieldName="sorting_title">MyApp.Indexing.ComputedFields.SortingTitle, MyApp</field>
          </fields>
          <fieldReaders ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/fieldReaders"/>
          <indexFieldStorageValueFormatter ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/indexFieldStorageValueFormatter"/>
          <indexDocumentPropertyMapper ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/indexDocumentPropertyMapper"/>
          <include hint="list:IncludeTemplate">
            <BlogPostTemplate>{7FAFEDF6-9438-4CAD-9E04-3FCD89206D2F}</BlogPostTemplate>
          </include>
          <include hint="list:IncludeField">
            <!-- title -->
            <fieldId>{83BCB10F-AC40-4E94-8913-363916D2411B}</fieldId>
            <!-- body -->
            <fieldId>{A6C091C8-9544-493E-AA80-B6DF5504F2FD}</fieldId>
            <!-- keywords -->
            <fieldId>{FDAC4C72-967C-456E-A44F-C29BAA1BC007}</fieldId>
            <!-- posted date -->
            <fieldId>{FF68B450-E5FE-4AA2-83B9-695B260A21A0}</fieldId>
            <!-- tags guids -->
            <fieldId>{070D64BB-EF7A-432C-A6E3-416B21886F06}</fieldId>
          </include>
        </blogIndexConfiguration>
      </indexConfigurations>
    </contentSearch>
  </sitecore>
</configuration>

strategies

These elements inform Sitecore how to keep your index up to date.

The onPublishEndAsync strategy handles an event fired whenever an item is updated. Changed items are batched and re-indexed.

The intervalAsyncCore strategy re-indexes changed documents on a schedule. I’ve changed the parameters to focus on a different database and to use a different schedule.

https://www.sitecore.net/learn/blogs/technical-blogs/john-west-sitecore-blog/posts/2013/04/sitecore-7-index-update-strategies.aspx

database and directory

Specify the database and directory you want crawled. You can add more than one crawler node with different databases or directories.

raw:AddFieldByFieldName

When using <indexAlLFields>false</indexAllFields> as I have here, you need to configure how you want each field to be indexed.

In this example, I’ve stored all the fields, because that makes it possible to view them in Luke. I’ve stored posted date as an untokenized field because you’re only ever interested in the full date.

To reference a Sitecore field, use the name of the field in lower case. So to index the “Posted Date” field, I have referenced “posted date” in the config. Within Lucene, it will actually be “posted_date” because Lucene doesn’t allow spaces in field names without having to escape them every time.

sorting_title

In order to sort by a field, you need to store it untokenized so that Lucene can access it as one whole term. A ComputedField (discussed below) is used to find it.

include hint=”list:IncludeTemplate”

Here list the template ids that you want indexed. Sitecore will crawl the directories listed in the locations node and index any item with the specified templates

Convert multi list items into a space separated list of names

For example, with the blog tags, Sitecore will by default index a multi list as a space separated list of lowercase guids (Guid.ToString("N")). This is useful for filtering when you know the guid ahead of time, but not when responding to a user’s free form search field.

In those situations, it can be useful to also denormalize the relationship and store the tag names along side the document. A Computed Field, discussed below, can iterate over the related Tag items and turn them into a single field containing their names.

So instead of just indexing

97b12832db3a498b8851232fb086676b 5be56cc8dc304a078f689ed1516e6736 614ae46d35f445eab7e97761e589031a

You would wind up storing

asp.net c-sharp sitecore

which is more searchable.

Note you still probably want to keep the guids around for cases when you want to limit the results to just a particular tag. In those cases you know the guid ahead of time since you used it to render the drop-down or whatever.

Computed Fields

Often times you need to massage data before putting it into the index. Maybe you want to fetch a related record or add default values.

This is pretty easy: you just create a class that implements IComputedIndexField and handle the ComputeFieldValue method.

ComputedFields.cs

namespace MyApp.Indexing.ComputedFields
{
    public abstract class ComputedField : IComputedIndexField
    {
        public abstract object ComputeFieldValue(IIndexable indexable);
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        private readonly Lazy<Database> _webDatabase = new Lazy<Database>(() => Database.GetDatabase("web"));
        protected Database WebDatabase
        {
            get { return _webDatabase.Value; }
        }
    }

    public class ProductName : ComputedField
    {
        public override object ComputeFieldValue(IIndexable indexable)
        {
            var productId = indexable.GetFieldByName("Assoicated Product")
            var product = WebDatabase.GetItem(ID.Parse(field.Value));
            return product["Name"];
        }
    }

    // ProductDescription and ProductInstructions elided: they're basically the same

    public class TagNames : ComputedField
    {
        public override object ComputeFieldValue(IIndexable indexable)
        {
            var blog = WebDatabase.GetItem(ID.Parse(indexable.Id))

            MultilistField f = blog.Fields["Tags"];
            if (f != null)
            {
                var tags = f.GetItems();
                if (tags == null || tags.Length == 0)
                    return null;

                return string.Join(" ", tags.Select(t => t["Name"]));
            }

            return null;
        }
    }

    public class SortingTitle : ComputedField
    {
        public override object ComputeFieldValue(IIndexable indexable)
        {
            return indexable.GetFieldByName("Title").Value;
        }
    }
}

Note: tons of error checking elided. You should check pretty much every API call for null, and return null from ComputeFieldValue if you don’t find anything to index.

Dates as yyyyMMdd format by default

Know that dates are stored in yyyyMMddd format (i.e. “20150401”). Query accordingly.

Querying

Sitecore ships a LINQ provider so, similar to Linq2Sql or Entity Framework, you can query the Lucene without getting bogged down in BooleanQuery and string keys.

Mapping to entity

Sitecore doesn’t really provide a way to drop down to raw Lucene.net, so you have to use their LINQ API.

Create an entity that inherits from Sitecore.ContentSearch.SearchTypes.SearchResultItem. This brings in some nice useful pre-mapped properties.

Map the other properties you might want to search on using IndexFieldAttribute.

BlogSearchResultItem.cs

public class BlogSearchResultItem : SearchResultItem
{
    [IndexField("title")]
    public string Title { get; set; }

    [IndexField("body")]
    public string Body { get; set; }

    [IndexField("keywords")]
    public string Keywords { get; set; }

    [IndexField("product_description")]
    public string ProductDescription { get; set; }

    [IndexField("product_name")]
    public string ProductName { get; set; }

    [IndexField("product_instructions")]
    public string ProductInstructions { get; set; }

    [IndexField("posted_date")]
    public DateTime PostedDate { get; set; }

    [IndexField("tags")]
    public string TagGuids { get; set; }

    [IndexField("tag_names")]
    public string TagNames { get; set; }

    /// <summary>
    /// Stored as untokenized so we can sort by it meaningfully
    /// </summary>
    [IndexField("sorting_title")]
    public string SortingTitle { get; set; }
}

Couple of extra attributes to be aware of:

PredefinedQueryExample.cs

[PredefinedQuery("TemplateID", ComparisonType.Equal, "{7FAFEDF6-9438-4CAD-9E04-3FCD89206D2F}", typeof(ID)]
public class Example: SearchResultItem
{
        [IndexField("property")]
        public string Property { get; set; }

        [IndexField("content")]
        [IgnoreIndexField]
        public string Content { get; set; }
}

[PredefinedQuery] - Put this on the class and you can define a clause in the query that will be added to every search. Useful for limiting your entity to certain templates. Note that the property name is the name of the property on the class, not the name of the underlying Lucene or Sitecore field. Thats why its using “TemplateID” above instead of “_template”. You can stack them too.

This will always yield a +_template:7fafedf694384cad9e043fcd89206d2f clause in the query.

[IgnoreIndexField] - Skip mapping this field back from the Index. Sitecore will still yield the clause in the query, but when hydrating the Example object after the query, this property will not be set. This is useful if you aren’t storing the field or if you want to skip the (rather negligible) performance hit of loading the field back from the index.

Neither of these sources (as some commenters point out) are correct to say the field name in a PredefinedQuery attributes is the lucene fields - it is actually the name of the property on the entity class.

LINQ operators

Sitecore gives some extra special LINQ operators in Sitecore.ContentSearch.Linq.MethodExtensions

Create a context and get to querying:

Query.cs

using (var context = ContentSearchManager.GetIndex("blog_posts").CreateSearchContext())
{
     var query = context.GetQueryable<BlogSearchResultItem>();

     // posted_date:[20150401 TO 99991231]
     query.Where(p => p.PostedDate.Between(DateTime.Today, DateTime.MaxValue, Inclusion.Both);

     // posted_date:[* TO 20150401]
     query.Where(p => p.PostedDate < DateTime.Today);

     // posted_date:[20150401 TO *]
     query.Where(p => p.PostedDate > DateTime.Today);

     // title: foo
     query.Where(p => p.Title == "foo");
     query.Where(p => p.Title.Equals("foo"));

     //title : foo^2.0
     query.Where(p => p.Title.Equals("foo").Boost(2.0f));

     //title: b?y (match boy bay etc)
     query.Where(p => p.Title.MatchesWildcard("b?y"));


     // title: johnson~0.75
     // Similarity of 0.75 - will match things like "johnson" "johnston" etc
     // really cool :)
     query.Where(p => p.Title.Like("johnson", 0.75f))

     query.OrderBy(p => p.SortTitle);
     query.OrderByDescending(p => p.SortTitle);

     // title: *foo*
     query.Where(p => p.Title.Contains("foo"));

     // Paginate. Pages are 0 indexed
     int page = 0, perPage = 10;
     query.Page(page, perPage);

     // Use GetResults to get an objet with the total matches and an enumerable
     // you can work with
     return query.GetResults();
}

Remarks on Queries

.OrderBy and .OrderByDescending_need to use untokenized fields
== and .Equals are the typical search query. They do not mean “find a document where the field is equal to my word” but rather “find a document who has a field with a term equal to my word”.
.Contains is a wild-card search. Not very performant.
.Boost must go on .Equals or other method, not after an operator like ==. For example: .Where(p => p.Title == "foo".Boost(2f)) will not work.
Boosting provides signals to Lucene on what is more relevant. For example, finding a user’s search term in the title or keywords of a document is probably more relevant than one where the term is only in the body. Judicious use of Boosting can exponentially improve user experience.

Predicate builder for all but the most simple of queries

Use the PredicateBuilder for all but the most simplest of queries. This will allow you to build better nested queries. For example: we need documents that are posted after a certain date, have a certain tag, and have a search term matching in any of a number of fields. An example is given at the end of this article.

Initialize with false for OR queries and true for AND queries

Split user input into terms

Split the user’s search phrase into terms and iterate over them adding them one by one to build a query that is relevant when at least one of the terms match. Lucene will calculate a relevance score based on the distribution of search terms in the index.

Lowercase user input

Most of the time Sitecore configures Lucene with a lowercase analyser. I like to normalize user input so that we actually only query with lower case search terms.

Use Page and GetResults for paging

.Page(page, perPage) helps Lucene and Sitecore page your results.

Call GetResults to get back an object with a TotalResults property and a Hits property you can map into your real domain object.

Putting it all together

Search for published posts (post date is earlier or equal to today) in a certain tag and given the user’s search term. There is a sorting parameter that should be used as well.

SearchingAllTheThings.cs

public enum SortOrder
{
    Relevance,
    NameAsc,
    NameDesc
}

public class BlogSearchParams
{
   public SortOrder Sort;
   public string UserQuery;
   public Guid? TagId;
   public int Page;
   public int PageSize;
}

public IEnumerable<BlogPost> SearchBlog(BlogSearchParams parameters, out int totalResults)
{
    Debug.Assert(!string.IsNullOrEmpty(parameters.UserQuery));

    using (var context = ContentSearchManager.GetIndex("blog_posts").CreateSearchContext())
    {

        // use True because we'll be ANDing this clause together
        var filterPredicate = PredicateBuilder.True<BlogSearchResultItem>()
            .And(p => p.PostedDate <= DateTime.Today);

        if (parameters.TagId != null)
            filterPredicate = filterPredicate.And(p => p.TagGuids.Equals(parameters.TagId.Value.ToString("N")));

        var terms = parameters.UserQuery.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

        // use False because we'll be ORing this clause together
        var termPredicate = PredicateBuilder.False<BlogSearchResultItem>();

        foreach (var term in terms)
        {
            termPredicate = termPredicate
                .Or(p => p.Title.Like(term, 0.75f)).Boost(2.0f))
                .Or(p => p.Keywords.Equals(t).Boost(1.5f))
                .Or(p => p.Body.Equals(t))
                .Or(p => p.ProductName.Equals(t))
                .Or(p => p.ProductDescription.Equals(t))
                .Or(p => p.ProductInstructions.Equals(t));
        }

        // This applies some nesting so you wind up with
        // +post_date:[* TO 20150401] +tag:97b12832db3a498b8851232fb086676b +(title:foo keywords:foo)
        // i.e. must have PostDate && Tag && (Title || Keywords || ProductName ... )
        var predicate = filterPredicate.And(termPredicate);
        var query = context.GetQueryable<BlogSearchResultItem>().Where(predicate);

        // Apply the appropriate sorting. If we don't call a OrderBy method,
        // the default will be sorting by Lucene's relevance score
        switch (parameters.Sort)
        {
            case SortOrder.NameAsc:
                query = query.OrderBy(p => p.SortingTitle);
                break;
            case SortOrder.NameDesc:
                query = query.OrderByDescending(p => p.SortingTitle);
                break;
            //default is relevance
        }

        var results = query.Page(parameters.Page, parameters.PageSize).GetResults();

        totalResults = results.TotalSearchResults;
        return results.Hits.Select(h => GetBlog(h.Document.ItemId.Guid)).ToArray();
    }

    public BlogPost GetBlog(ID id)
    {
         // fetch a Blog item from Sitecore and
         // map to a BlogPost domain object
    }
}

Map back to domain object… or not?

If you have everything in the index that you need to display, then it can be more performant to just return BlogPostSearchResultItem and use that in the view.

I don’t particularly like this approach as I feel that BlogPostSearchResultItem is an implementation detail of the search API and I would generally prefer not to leak it into the presentation layer. Doing so can also increase your index size: you need to store fields that you aren’t going to search on but will need for display. And in some cases its not practical: for example if each Blog had an image that needed to show on the results page, you’d still need to hit the DB for the image.

I’ve found that Sitecore query performance is really quite excellent when loading by ID, as done here. I can search Lucene and hydrate 20 BlogPosts in 10-12ms.

Conclusion

I don’t have one. Go get to searching!