memonic

Apache Solr 1.5 on the move with more

Save

Apache Solr 1.5 on the move with more “functionality”

The paint is barely dry on Apache Solr 1.4 and the community is already on the move for Solr 1.5 (which may actually be Solr 2.0, but for now let’s call it 1.5).

I’m particularly excited about a few things:

  1. Massive scalability capabilities via distributed search, indexing and shard management – Up until now, Solr scales pretty well on the search side (I’ve seen billion+ document instances and we’ve benchmarked it at that level too), but the work underway in Solr 1.5 will take it to a whole new level, thanks to the integration of Apache ZooKeeper and other distributed technologies.  For those interested, check out the “cloud” branch in SVN.
  2. Functions, functions, functions!  We’ve already added a bunch of functions (see my earlier post) and I see more on the horizon.  Additionally, I see great value in adding, for lack of a better phrase, aggregating functions to the mix (via SOLR-1622).  This will allow application designers to do much more sophisticated math across a search result set than what is currently available via the StatsComponent.  In some ways, this can empower business intelligence applications on top of Solr (I realize it is just a small piece of the BI pie) as well as more sophisticated mathematical applications.
  3. Spatial Search!  It’s funny, a lot of people want spatial search and Solr could have simply harnessed a really nice existing package (LocalSolr) just as many already do, but by stepping back and taking a look at spatial in the context of the bigger picture of things (see SOLR-773) that would be nice to have in Solr, the community will be able to not only implement spatial search (by leveraging key pieces of LocalSolr where appropriate), but will also get a whole bevy of other features, including:
    1. Sort By Function – Instead of a one off that sorts solely by distance, why not enable Solr users to sort by any arbitrary function? I just committed this tonight via SOLR-1297.
    2. “Poly” Field Types – Thanks to SOLR-1131, Solr’s FieldType mechanism can be used to represent multiple underlying fields.  This is especially useful for representing things like points in an n-dimensional space, Cartesian Tiers (zoom levels) and other cool things.  Moreover, it shows the types of abstractions Solr can overlay on the already powerful Apache Lucene to provide even more functionality.
    3. Facet By Function – Sure, it’s great to put your distances into buckets, but why not put the result of any function into buckets?  See SOLR-1581.
    4. Spatial Query Parsers – aka geocoding – Parse things like street addresses, etc. and get back appropriate Query instances. See SOLR-1568 and SOLR-1578.
    5. Several different distance functions, including haversine (great circle), Manhattan, Euclidean (Solr actually now supports all p-norms as distance functions.)  See SOLR-1302.  I even added in the ability to do String distance calculations using Levenstein (edit), Jaro-Winkler, n-gram (basically all of the Lucene spellchecker distance measures, as well as any user defined String Distance calculation.
    6. “pseudo” fields – Instead of just hacking the ability to put a distance calculation into the result, why not allow the response to stream out “fields” based on things like functions or other user defined values?  See SOLR-1298.
  4. Field Collapsing – I haven’t had time to work on it, but I suspect Field Collapsing will finally make it into 1.5.  Field Collapsing allows Solr to “roll-up” similar results much like you see on many Internet search sites that indent results from the same domain.
  5. Payload and Span Query support – Solr’s been able to index payloads for some time now, but it still requires a user to hook in their own query parser support.  It would also be really great to see functions that can work on payloads, too. See SOLR-1337 and SOLR-1485.

Of course, as I always say, “in open source, you never know where the next good idea is going to come from”, so I have total faith that the Solr community will come up with a plethora of other great new features, as well as the usual bug fixes, etc.

Fun with Solr Functions

Save

Fun with Solr Functions

For a long time now, Solr has had a good chunk of functions available for use to boost relevance based on the content of a field, but I’ve always been on the user side of them and never on the writing side.  At least, that is, until recently.  This week I have been putting the finishing touches on an article on using Lucene and Solr for spatial search.  As part of the article, I had a chance to write up a bunch of new functions for Solr, including several for calculating distance between two points.  These functions are:

  1. hsin and ghhsin (geohash based haversine) – Calculate the Haversine distance (Great Circle) between two points on a sphere
  2. Lp norm – i.e. the 1-norm (Manhattan distance), 2-norm (Euclidean), etc.
  3. radian/degree converter
  4. Geohash converter – convert latitude/longitude pair to a geohash value.

On top of that, Yonik added in support for pretty much all of java.util.Math, including cosine, sine, tangent, etc. so now Solr really has some very significant mathematical capabilities, all of which can powered by the values selected by doing a search.  Not too mention, they can be used for filtering too, using the FunctionRangeQParser (frange for short).

As an example, here’s a few Solr requests I generated for my article:

http://localhost:8983/solr/select/?q={!func}recip(hsin(0.78, -1.6, lat_rad, lon_rad, 3963.205), 1, 1, 0)

http://localhost:8983/solr/select/?q=*:*&fq={!frange l=0 u=10}dist(2, 32, -79, lat, lon)

In the first example, I’m scoring all my docs based on the great circle distance from a specific point on the globe, while in the second, I’m filtering out docs based on distance.

I’ve also seen functions used in all sorts of different ways.  One customer was even using them to simulate alternate scoring models by using the ExternalFileValueSource.  Additionally, one of the nice things they can also do is give you much higher precision document/field boosts.

Up until now, I had never realized how dead-simple they are to write.  Basically, take in a ValueSource, convert it to a DocValues instance which gives you the value of the field for each document and then do your calculation.  Finally, register it in the solrconfig.xml.  For more on this, see the SolrPlugins wiki. page.

Finally, keep an eye on Solr 1.5, as I have an inkling to add what I’m tentatively calling aggregating functions, which I think can be used to power more business intelligence applications.  These aggregating functions would be a significant extension of Solr’s StatsComponent and allow for much more expressive calculations over result sets.