This is a WIP: I have a couple new features I want to write about on here, as well as clean up the examples a bit.
Recently I’ve been trying to implement Sitecore 7’s new ContentSearch features for a client and had to figure a lot of things out. Documentation has been sparse, misleading, and often times incomplete or obscure.
This article aims to list out the hard-won techniques I learned during the implementation phase, in hopes of alleviating the pain others are likely experiencing, and as a reminder to myself should I need to revisit this in the future.
We’re going to walk through setting up a hypothetical blog search feature, where we have the following information hierarchy:
- Title (text)
- Posted Date (Date)
- Body (rich text)
- Keywords (text)
- Associated Product (drop-down link to products)
- Tags (multi list)
- Name (text)
- Description (rich text)
- Instructions (rich text)
document- a single entity stored by Lucene: a product, a blog post, etc
fields- a property of a stored document
index- a set of documents stored by Lucene
stored field- a field in which the value is actually saved within Lucene
unstored field- a field in which Lucene indexes, but does not actually store the value. So Lucene would know that the field contained the word “foobar” (and possibly where in the string it was found) but could not give you back the whole field
tokenized field- a field that is broken up into terms (words). Use this if you want to search for documents with a field containing a term. The field value “foo bar baz” would be indexed as the list of “foo” “bar” and “baz”
untokenized field- a field that is stored and indexed as one single term. Use this when the field doesn’t make sense to split up, for example prices, dates, cities
You’re going to want to get Luke - Lucene Index Toolbox which is on Google code, at least until that gets shuttered. This will allow you to test queries, see what values are being indexed, and their tokenization status.
Make sure the
~/App_Config/Include/Sitecore.ContentSearch.* files are added to
Create a new config called
I’m going to paste the config in and then elaborate on parts of it. You can adapt it to your needs.
These elements inform Sitecore how to keep your index up to date.
onPublishEndAsync strategy handles an event fired whenever an item is
updated. Changed items are batched and re-indexed.
intervalAsyncCore strategy re-indexes changed documents on a schedule.
I’ve changed the parameters to focus on a different database and to use a
Specify the database and directory you want crawled. You can add more than one crawler node with different databases or directories.
<indexAlLFields>false</indexAllFields> as I have here, you need to
configure how you want each field to be indexed.
In this example, I’ve stored all the fields, because that makes it possible to
view them in Luke. I’ve stored
posted date as an untokenized field because
you’re only ever interested in the full date.
To reference a Sitecore field, use the name of the field in lower case. So to index the “Posted Date” field, I have referenced “posted date” in the config. Within Lucene, it will actually be “posted_date” because Lucene doesn’t allow spaces in field names without having to escape them every time.
In order to sort by a field, you need to store it untokenized so that Lucene can access it as one whole term. A ComputedField (discussed below) is used to find it.
Here list the template ids that you want indexed. Sitecore will crawl the directories listed in the locations node and index any item with the specified templates
For example, with the blog tags, Sitecore will by default index a multi list as
a space separated list of lowercase guids (
Guid.ToString("N")). This is
useful for filtering when you know the guid ahead of time, but not when
responding to a user’s free form search field.
In those situations, it can be useful to also denormalize the relationship and store the tag names along side the document. A Computed Field, discussed below, can iterate over the related Tag items and turn them into a single field containing their names.
So instead of just indexing
97b12832db3a498b8851232fb086676b 5be56cc8dc304a078f689ed1516e6736 614ae46d35f445eab7e97761e589031a
You would wind up storing
asp.net c-sharp sitecore
which is more searchable.
Note you still probably want to keep the guids around for cases when you want to limit the results to just a particular tag. In those cases you know the guid ahead of time since you used it to render the drop-down or whatever.
Often times you need to massage data before putting it into the index. Maybe you want to fetch a related record or add default values.
This is pretty easy: you just create a class that implements
Note: tons of error checking elided. You should check pretty much every API call for null, and return null from
ComputeFieldValueif you don’t find anything to index.
Know that dates are stored in
yyyyMMddd format (i.e. “20150401”). Query
Sitecore ships a LINQ provider so, similar to Linq2Sql or Entity Framework, you can query the Lucene without getting bogged down in BooleanQuery and string keys.
Sitecore doesn’t really provide a way to drop down to raw Lucene.net, so you have to use their LINQ API.
Create an entity that inherits from
Sitecore.ContentSearch.SearchTypes.SearchResultItem. This brings in some nice
useful pre-mapped properties.
Map the other properties you might want to search on using
Couple of extra attributes to be aware of:
[PredefinedQuery] - Put this on the class and you can define a clause in the
query that will be added to every search. Useful for limiting your entity to
certain templates. Note that the property name is the name of the property on
the class, not the name of the underlying Lucene or Sitecore field. Thats why
its using “TemplateID” above instead of “_template”. You can stack them too.
This will always yield a
in the query.
[IgnoreIndexField] - Skip mapping this field back from the Index.
Sitecore will still yield the clause in the query, but when hydrating the
Example object after the query, this property will not be set. This is useful
if you aren’t storing the field or if you want to skip the (rather negligible)
performance hit of loading the field back from the index.
Neither of these sources (as some commenters point out) are correct to say the field name in a PredefinedQuery attributes is the lucene fields - it is actually the name of the property on the entity class.
Sitecore gives some extra special LINQ operators in
Create a context and get to querying:
.OrderByDescending_need to use untokenized fields
.Equalsare the typical search query. They do not mean “find a document where the field is equal to my word” but rather “find a document who has a field with a term equal to my word”.
.Containsis a wild-card search. Not very performant.
.Boostmust go on
.Equalsor other method, not after an operator like
==. For example:
.Where(p => p.Title == "foo".Boost(2f))will not work.
- Boosting provides signals to Lucene on what is more relevant. For example, finding a user’s search term in the title or keywords of a document is probably more relevant than one where the term is only in the body. Judicious use of Boosting can exponentially improve user experience.
Use the PredicateBuilder for all but the most simplest of queries. This will allow you to build better nested queries. For example: we need documents that are posted after a certain date, have a certain tag, and have a search term matching in any of a number of fields. An example is given at the end of this article.
Initialize with false for OR queries and true for AND queries
Split the user’s search phrase into terms and iterate over them adding them one by one to build a query that is relevant when at least one of the terms match. Lucene will calculate a relevance score based on the distribution of search terms in the index.
Most of the time Sitecore configures Lucene with a lowercase analyser. I like to normalize user input so that we actually only query with lower case search terms.
.Page(page, perPage) helps Lucene and Sitecore page your results.
GetResults to get back an object with a
TotalResults property and a
Hits property you can map into your real domain object.
Search for published posts (post date is earlier or equal to today) in a certain tag and given the user’s search term. There is a sorting parameter that should be used as well.
If you have everything in the index that you need to display, then it can be
more performant to just return
BlogPostSearchResultItem and use that in the
I don’t particularly like this approach as I feel that
BlogPostSearchResultItem is an implementation detail of the search API and I
would generally prefer not to leak it into the presentation layer. Doing so can
also increase your index size: you need to store fields that you aren’t going
to search on but will need for display. And in some cases its not practical:
for example if each Blog had an image that needed to show on the results page,
you’d still need to hit the DB for the image.
I’ve found that Sitecore query performance is really quite excellent when loading by ID, as done here. I can search Lucene and hydrate 20 BlogPosts in 10-12ms.
I don’t have one. Go get to searching!