zen of coding

Let’s talk about the “Search”…

Not sure how to get this post started exactly, but let me first say that: “I am not a Doctor… Err… Search Engine specialist”…

Now, I think, it is an interesting topic that a few people might find helpful and perhaps (and hopefully) some could even provide a little further insight in the comments as my dear readers often do…

So does your app need a search feature?.. More than often — it does.
(Do I need to mention that there are about 92308.6147 solutions that exist out there?)

I am going to go over some of the findings, headaches and success we’ve had while implementing a super-cool-and-robust search feature.

However, before proceeding, let’s take a few things into consideration:

  1. The search feature has to be fast and flexible; a simple “LIKE” is not going to cut it in this case.
    If this is beyond your current needs you might stop reading here :)
  2. MySQL’s, MyISAM engine has a lovely full-text search capability, but it is lacking in some areas. (Unfortunately too many details to list, but they are out there).
  3. MySQL’s, InnoDB does not have a lovely full-text search capability, but it does offer better performance (i.e. no table locks) and is the only supported option (at least at the time of this writing) on Amazon’s RDS.
  4. No need to try and beat Google at this game.

Given all the potential options, as always, it is extremely important to use the right tool for the right job. (So anything below may be a guiding light for your project or a complete dead-end, but at least, and hopefully, you’ll get to that realization sooner than later).

Let’s plow on…

I should share some of the high-level specifics of the app where the super-duper search engine was required.

In general, it is a rather active (yet simple) forum application (CakePHP, of course) with over a million comments, and with a new post or comment coming in approximately every few seconds or so (Decent LAMP stack powered by AWS).

Therefore we had to evaluate some options to make sure (as mentioned) we use right tool for the right job.

  1. MyISAM full-text search. Excellent for simple needs, not powerful enough for our requirements. Let me say a few quick points about it… Default index is four characters, and we actually had a business requirement to have at least a two character search, under any circumstances. Yes, you can index with 2 chars, but the performance comes to a grinding halt (OK, I am being a bit dramatic here, but it does nothing to improve what is inherently “not so great”). Not to mention that after moving to InnoDB we simply could not use MyISAM anymore.
  2. Google’s custom search. Why not let the best in business handle your needs?
    Well, to fully utilize the best of the best, you need to have a business account (to allow for customized branding, no-ads, etc. which is very important in some cases). You can certainly look-up some of the pros and cons and options at the Google’s custom search pages. In most cases I would stop right here and suggest that people give it a shot. (But then the post would be too boring)…
  3. Google Appliance. One of the issues that we faced right off the bat, was the fact that in order to access the app (and consequently the search feature) you had to be authenticated. Which, from my research (albeit somewhat limited) could only be done by hosting the google appliance. Obviously having your own little piece of google is cool, but the price and overhead of maintenance are certain drawbacks, which we weren’t ready to face, at least just yet.
  4. Lucene. A very popular and quite powerful tool for database indexing and searching. There are a few implementations out there… The most important for us, cake-people, is the Zend PHP version. While the tool is great (and is actually used by CakePHP’s own manual Update: CakePHP manual is now powered by Sphinx) it has a few extremely strange behavioral problems. Without going into much detail, there are 3 magic numbers (options), which control the auto-optimization of your index.
    Again, I am not a Lucene expert, but after trying out Lucene in a real-world application the gravity of attempts to properly adjust the auto-optimization options in order to keep the application performing smoothly seemed liked trying to control a magic carpet flight.
    Alrighty, to be fair, if you have a rather static site (similar to cake’s manual) there should be no issue with using Lucene. Needless to say, there are a few CakePHP-specific ways, which show you how to integrate Lucene into your environment. However, for a highly dynamic site (i.e. the forum) the auto-optimization of the index would bring down the application and make everyone cry, not to mention the indexing of the existing content would take hours upon initial installation.
  5. Sphinx. While hesitant at first, we’ve decided to give it a shot… One issue is that you have to install sphinx as as service (daemon) running on your web server. In many cases (smaller applications or shared hosting) it could be a showstopper. That being said, the installation is quite simple, and very little effort is required as far as any maintenance. (It also has a very low foot-print and so far, after a few months of real-world testing, it has not impacted server performance in any way).

Well… now you can probably guess that Sphinx was our final candidate and ultimately the chosen solution. A few things that made my colleagues and I very impressed was that the indexing is extremely fast. Compared to Lucene the initial indexing only took minutes for about one million records.
As well, and quite importantly, updating the index for any new posts/comments is also very fast.

By default the results are filtered by relevance as well as the date of the post (well, at least in our app it had an importance).
For example, most relevant and recent posts would be at the top of the resultset of a search, while older, yet still relevant results would appear lower.
(Word highlighting and other “neat” features are also available, but justly so… with other search tools as well).

Now then, what about the actual implementation?

Once the decision has been made to try out Sphinx, it was actually rather simple…

First of all a HUGE “thank you”, to the creator of this excellent behavior.
It has performed flawlessly in both CakePHP 1.2 and 1.3… (certainly some adjustments might be required for anyone’s specific needs, but the foundation, which has been laid down, is outstanding).

Once the behavior is properly attached to the required models, the Sphinx configuration couldn’t be easier:
(I will skip over the defaults and just point out what was required to get this thing off the ground)

source main
This is where the overall configuration of the Sphinx search engine is stored, as well as our initialization query to get thing up and running:

[cc language=”sql”]
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO forum_counter SELECT 1, MAX(id) FROM forum_comments
sql_query = SELECT id, category_id, topic_id, user_id, body, UNIX_TIMESTAMP(created) AS created FROM forum_comments WHERE active = 1

sql_attr_uint = topic_id
sql_attr_timestamp = created

sql_query_info = SELECT * FROM forum_comments WHERE id = $id
[/cc]

Yep, besides any server-side defaults this is all that was custom-tailored and needed to get things going.
I hope you see how simple the queries are and can utilize the setup in your app.
(Notice, that we are using cake’s excellent counter cache here).

source delta : main
[cc language=”sql”]
{
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, category_id, topic_id, user_id, body, UNIX_TIMESTAMP(created) AS created FROM forum_comments WHERE active = 1 AND id > ( SELECT max_doc_id FROM forum_counter WHERE counter_id = 1)
}
[/cc]

This little snippet controls the “delta”, i.e. the difference between the original index and, well, any new additions to the forum.
Notice the max_doc_id, which is referring to the Sphinx index.

Again, besides the defaults (and attaching the above-mentioned behavior), this is all that was needed to be done to get a really great search engine working in our app.

I know that this has become a rather long post already, so I’d like to cut it short right about now…

  1. Please ask any specific questions in the comments. I do realize that everything above is quite a generic overview.
  2. Sphinx doesn’t play well with UUID’s, hopefully someone will prove me wrong ;)
  3. Default search return is 1,000 results… in most cases this is more than plenty, but hopefully someone can show us a cake-way to bring back even more.

Well then, if you’ve made it this far, next round of beers is on me ;)

  • Christopher Vrooman

    Doesn’t Sphinx still require that everything you index have a unique integer? So UUIDs are out. And if you were indexing more than 1 table, how did you resolve the necessity of keeping the IDs unique when Cake’s tables historically are auto-increment?
    I need to add search capability to a highly dynamic site, and I looked at Sphinx, but those two points were showstoppers when I wanted to index information from many tables and not all tables where initially planned to be auto-increment.
    I know you can bump the auto-increment +2, +3, +5, etc, but eventually the indices will duplicate between tables and then, to quote from Sphinx’s site: “… and a 300 pound gorilla will climb out of your monitor and throw barrels at you…” unless extreme care and extra precautions are taken.

  • Interesting article. I wrote a Zend_Search_Lucene datasource for CakePHP. While it was a great exercise and taught me a lot about both Cake and Zend, in the end I’m a little disappointed because, as you pointed out, it tends to be slow. I usually only index once per day via a cron job, and the performance still isn’t great. I often just find myself using MySQL instead. Maybe I haven’t investigated the optimization methods enough.

    But I think I will give Sphinx a try, since, as you said, MySQL is pretty limited, especially if you want a full-site search and not just one model (e.g. products). The biggest hurdle for me will be setting up Sphinx, since all of the sites I develop are in a shared hosting environment.

  • jblotus

    Interesting post on a subject I have yet to work on. In my latest site I need to search rental listings by three criteria:

    1) location (city, state, or similar)
    2) searching a description for the text (if someone looked for Folly Field, it might be in a listing for Charleston or Hilton Head Island)
    3) searching tags or keywords

    So far I have been using a city/state search only that uses autocomplete to help find the right city. That works ok for my site but it would be cool to have a more advanced search option.

  • Pingback: Are there any other PC2Phone reliable services like VoIP Raider? | Custom oil painting()

  • jblotus

    Also I have heard some good things about Yahoo BOSS (build your own site search). Anyone know how that stacks up?

    • Don’t rely on BOSS. It will probably be canned before long. You can search for Save BOSS, but I don’t think the campaign will succeed.

  • You’re right when you say Lucene is hard to configure.
    Nevertheless, it’s worth a try using SolR, which is based on Lucene and totally rocks.
    Zend_Lucene_Search is painfully slow, SolR with the SolR PHP client is quite good.
    With the dataimport functionnality of SolR, you can index millions of rows from a database in less than 5 minutes.

  • @Guillaume

    Thanks for that suggestion as well.

  • Sorry for the very late comment, It’s my first time reading your article. Anyway, when you said the CakePHP docs moved from Lucene to Sphinx, you mean that the new CakePHP docs are built using the “Sphinx Python Documentation Generator” and “Sphinx Search Server”?

    Reading on Mark Story’s article, it looks like he was referring to the former and for your example above, your were referring to the latter. Mark’s article did not mention that the CakePHP docs used the Search Server, though. Doing some search on the Interwebz, both products are different from each other (and seems like they are not related, and there’s also a Sphinx by CMU, but that’s another story).

    Just added in my thoughts for clarification. Thanks!

  • teknoid

    @OJ Tibi

    Thanks for point this out, I don’t know the specifics of CakePHP installation, I just mentioned it since Lucene is out of the picture now for sure. The article summarizes my personal experience with Sphinx server.

  • Thx @teknoid

    After a year from the original post – just wondering if you’d still recommend Sphynx for implementing site search, or whether there are new kids on the block?

    I’m building an app based around a friend network (yeah, me too) and I need to be able to return results only where there is an established connection between two users. Ideally queries would work in a Cake-style find / paginate method so I can do this in a neat way.

    I’m researching options after implementing the CDC search plugin but then realising the app needs fulltext style searches rather than basic ‘LIKE’ queries.

    Either way – I’ve been looking into search plugins and found this comparison really useful.

  • teknoid

    @Phil Sheard

    Yes, it would still be on top of the list for me.
    However… there are some new players in the field in the form SaaS providers.
    If you don’t mind paying extra for the search capability.

    I don’t have personal experience with them, but you can check out:
    http://www.searchblox.com/
    and more:
    http://stackoverflow.com/questions/2168634/hosted-full-text-search-solutions

    The other option is Solr, which is mentioned in the comments… but I have not had personal experience with it yet.

  • Thanks @teknoid

    I decided to try Sphinx. I made a couple of mistakes in setup on Ubuntu but now it’s running I’m delighted.

    The integration with the behaviour you linked to is perfect = you get access to the power of the sphinx complex search methods, but then the results come back as a Containable, well formed result set. This works for me, as I can then process these results based on matching conditions in the model assoiciations.

    IMHO Sphinx + Xumix’s behaviour should be the default search solution for anyone who has access to the server and is able to compile and run the daemon.

    Thanks again

  • Solr is only good for static sites, to update the documents you need to delete and insert them again…

    • We have used Solr quite a bit for some larger projects, and it’s very flexible. It’s dynamic fields feature, which can map to CakePHP’s internal data types makes it very easy to use an IndexableBehavior on models which will automatically index selected fields or whole documents on create/update/delete. It’s coming with UUID support and is really fast for large data sets.

  • I’ve been working to implement Sphinx into my website, so far so good. The only problem I’ve come across is that the result set does not fetch 3rd level models (defined with the Containable behavior):

    Business
    Branch
    City

    City is not fetched anymore. Any thoughts or ideas?

    • HI, Mauro! it’s been 3 years since you posted it… I wonder if you have solved the problem? i’ve come across this issue too. the only difference is that cake does not fetch any related model. If you have solved it could you help me ? please…. this would make my day!!

%d bloggers like this: