Result grouping or Field Collapsing with Solr

From a user perspective, when searching the first result pages were full of documents that look very similar, for instance getting a full page of the same car model, where only the edition differs, when searching for a specific car brand. What actually is desired is to only show the different models. Then and only when a user is interested in a certain model, the user can view all the editions of the model by clicking on the result. We simply want to group our search result, based on some criteria. Although this is not support out-of-the-box with Lucene/Solr, luckily it is possible using a patch that I’ve created and contributed to Solr. This blog entry explains what result grouping (also known as field collapsing) is and how you can start using it in your own projects.

Result grouping allows you to group results by a predefined field (E.g. model field). Only the most relevant documents per distinct field value of the predefined field will be kept in the result. The specified sort determines the relevance per document. By default in Solr the score is used for sorting, but that can also be a field value or a computed value like distance. In the Solr community result grouping is better known as field collapsing.

Assume we are searching for books. One search with field collapsing and one without and as you can see in the image.
fieldcollapse
As illustrated in the image, the similar values are removed from the result, only the most relevant documents are being kept in the result set.

Field collapsing can in some way be compared with the SQL GROUP BY statement. Although you cannot yet use functions like sum() or avg() to gather statistics (yet), it does remove the less relevant documents and keeps a count of how many documents were removed per distinct field value. In the most recent version of the patch it is possible to collect the field values of the collapsed documents. This allows you to execute your own function on the collapsed documents.

Setting up Field Collapsing

Unfortunately Solr does not support field collapsing out-of-the-box yet. The functionality is still under development, but it can already be used and many people have successfully done that already. If you browse to the Jira issue SOLR-236 you can see the current status of the field collapsing functionality. Download the latest patch, apply it to the latest Solr Subversion trunk and you are good to go.

Configuring Field Collapsing

Field collapsing is currently implemented in Solr as a SearchComponent and thus must be configured in the solrconfig.xml. The following line adds the field collapse component to Solr:

1
2
<searchComponent name="query"
class="org.apache.solr.handler.component.CollapseComponent" />

The QueryComponent is by default configured implicitly under the name query. By adding theCollapseComponent with the name query will make sure that the request handlers will automatically use the CollapseComponent instead of the default QueryComponent.

It also important to know upfront on what field you want to collapse. It is not possible to collapse on all types of fields. Currently, if you collapse on a field that is tokenized or multivalued an exception is thrown and the search is aborted.

I usually create dedicated field collapse fields in my schema.xml with a collapse_ prefix. I think that this is a good practice and it emphasis the use for that particular field. You can use any type of field you want (as long as it is not tokenized and not multivalued), the non-analyzed field types like StringField and IntField are good candidates.

Group your results

Now that you have configured field collapsing you can actually group your search results. To enable field collapsing you need to specify the field.collapse parameter in your request to Solr. Assume we want to group results on a field named ‘author’. This would result in the following url:
http://localhost:8080/solr/select?q=*:*&collapse.field=collapse_author

When the request returns a search result similar to the following is returned:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">117</int>
.....
<lst name="collapse_counts">
<str name="field">collapse_city</str>
<lst name="doc">
<int name="190810">48</int>
<int name="192224">9</int>
...
</lst>
<lst name="count">
<int name="Amsterdam">48</int>
<int name="Rotterdam">9</int>
...
</lst>
</lst>
<result name="response" numFound="26" start="0" maxScore="1.9735361">
<doc>
<str name="city">Amsterdam</str>
<str name="id">190810</str>
...
</doc>
...
</result>
</response>

There are two differences between this response and a response without field-collapsing:

  1. A list with the name collapse_counts is added to the response with the collapse counts per field value and per document identifier. The document identifiers in the collapse_counts are referring to the documents in the normal response.
  2. The response only contains the most relevant documents per group also known as the group heads. The term ‘group’ here means all documents with the same field value.

In the collapse_counts list there are two other lists. The doc list and the count list. Both are containing the collapse counts for the search result. The doc list associates the collapse counts to the result set by using the document head identifiers as pointer. Whereas the count list uses the field values to associate the collapse counts to the result set. It is important to know that both lists are referring to documents or field values in the current result page only and not to documents beyond that.

Besides the field.collapse parameter, there are more parameters that you can specify to tweak your groups in your result. They are described on the Field Collapsing page on the Solr wiki.

Collapsing Algorithms

There are two distinct ways of collapsing your search results:

  1. Adjacent field collapsing only collapses as the word adjacent implies documents with the same field value that appear in the non collapsed result set next to each other.
  2. Non adjacent field collapsing, also known as normal field collapsing. This collapse algorithm collapses as described in the beginning of this blog entry and is the default collapsing algorithm.

The type of field collapsing can be controlled with the collapse.type parameter. When the value adjacent is specified the adjacent algorithm kicks in and when the value normal is specified the normal algorithm kicks in.

Including Collapsed Results

In some occasions it is handy to know specific field values of the collapsed documents. In the most recent versions of the field collapse patch it is possible to include collapsed results. This can be achieved by using the collapse.includeCollapsedDocs.fl parameter. The patch expects a comma separated list of field names to include or a star (*) that instructs field collapsing to include all fields.

When the search has completed a collapse document result similar to the following will be returned:

1
2
3
4
5
6
7
8
9
10
11
12
<lst name="collapsedDocs">
<result name="Amsterdam" numFound="48" start="0">
<doc>
<str name="id">191178</str>
...
</doc>
...
</result>
<result name=”Rotterdam” numFound=”9” start=”0”>
...
</result>
</lst>

The collapsedDocs is part of the collapse_counts response and as you can see the collapsed documents are grouped under a distinct field value.

Using SOLRJ

If you are using SolrJ to integrate with your Solr instance you can use the added field collapse methods.
On the SolrQuery class I have added two methods:

  1. enableFieldCollapsing(String) which accepts a field name as argument.
  2. includeCollapsedDocuments(String...) which accepts zero or more field names. When no field names are given all fields are returned, otherwise only the specified field names are returned.

On the QueryResponse class one method is added:

  1. getFieldCollapseResponse() which returns theFieldCollapseResponse. The objects contains all the field collapse information.

The FieldCollapseResponse had four getter methods:

  1. getCollapseField() returns the field name during field collapsing.
  2. getFieldValueCollapseCounts() returns a list ofFieldValueCollapseCount, that contains a field value with a collapse count.
  3. getDocumentIdCollapseCounts() returns a list ofDocumentIdCollapseCount, that contains a document id with a collapse count.
  4. getCollapsedDocuments() returns a map with field value as key and aSolrDocumentList with the collapsed documents as value.

These methods can ease development when using field collapsing while integrating with a front-end system.

Field Collapsing and Facets

Field collapsing in combination with facets can be confusing the first time. The reason of that is that faceting can be performed on the ‘collapsed’ or ‘non collapsed’ result set. The facet counts on the ‘collapsed’ result set are usually less than the facet counts on the ‘non collapsed’ result set. Whether you want this is up to you because you can influence this behavior. The parametercollapse.facet determines on what result set to collapse. This parameter can have the valuefacet.before to collapse on the non collapsed result set or facet.after to collapse on the result set. The default behavior is to collapse on the collapsed result set. The performance for faceting on either the collapse or non collapsed result set from the field collapse perspective is the same.

Field Collapsing and Performance

Unfortunately field collapsing does influence the search time in a negative way. When doing a search with field collapsing enabled the search time can be 5 to 10 times slower than doing a search without field collapsing enabled. There are more things that can make your search time even worse:

  • Using Adjacent collapsing as collapse type. Adjacent collapsing can be an order of magnitude slower than non adjacent field collapsing. I have seen cases where performance dropped by more than nine times compared to normal field collapsing.
  • Using a collapse threshold higher than 1 in combination with normal collapsing. This has to do with the way the normal collapsing algorithm processes the documents that may be kept in the result. For a collapse threshold higher than 1 in combination with adjacent collapsing the performance will not worsen.
  • Including collapsed documents in the response. How much this feature increases the search time depends on how many documents are being collapsed and how many are being returned in the response. The latter decreases performance the most, because the returned documents have to be read from the index and be sent over the wire. If for example, 8000 documents were collapsed for a specific field value, you can imagine how enormous the increase in response time will be.
    • Performance improvement with the normal field collapse algorithm.
    • Performance improvement when faceting on the non collapsed result set.
    • The ability to include documents that have been collapsed.
    • Improved the code quality by adding unit and integration tests. Redesigned the solution code wise that resulted in cleaner code and thus more maintainable code.
    • Extended the SolrJ API to allow easy integration when using field collapsing.
  •  

original post can be found at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

Posted in lucene/SOLR | 5 Comments

Its All about conversation on Web

Well, recently I was researching on web2.0 social marketing to find some idea to make a IDEA design for a marketing tool. I am not marketing guy rather a  tech guy.

As the trend says that web2.0 social marketing is all about conversation and I got the idea that from the conversations, which are scattered on web and  one need to  find his target conversations , analyze it and take part into the conversation to introduce what he want to.

Hmmm….so how to take part in the conversation, that you will come to know when you are following a conversation.

I just want to concentrate to provide a service or tool which will combine all related conversation based on some key word search and give the user a analytical report as well as facility to join the conversation easily. User also need to keep track easily on what they are working on.

Well, I am not alone on this boat , already many companies are or already have implemented  the service. I found one which really impressed me. They have their demo site at http://www.ubervu.com/

Now my plan is to get some Idea from this site as well other similar sites and blend with my own ideas to make my plan.

Alright, now its time to do some work. I will come up with my plans one by one in my next posts.

So don’t forget to bookmark this blog and … … … stay tuned.

Posted in Ideas | Tagged , , , | Leave a comment