Welcome!

ColdFusion Authors: Maureen O'Gara, Hovhannes Avoyan, Yakov Fain, Pat Romanski, Liz McMillan

Related Topics: ColdFusion

ColdFusion: Article

Making The Most Of Verity

Making The Most Of Verity

Search functionality has become the status quo for all major Web sites. The typical search box/button found on home pages across the Net is considered the ultimate in user-friendly design: users type in what they're looking for and the search engine finds it quickly and easily.

By applying the tips and tricks illustrated in this article, developers can augment the Verity search engine that's packaged with ColdFusion to create a more robust - and scalable - search engine. All it costs is a little time and ingenuity.

ColdFusion Server comes packaged with the Verity search engine, a tool that makes short work of indexing, searching and retrieving information stored in virtually any format on Web and file servers. Yet the version of Verity included with ColdFusion Server provides only a limited subset of the functionality and features that are part of Verity's enterprise-level "Information Server."

This article explores some novel ways CFML and Verity can be implemented to build a more scalable search engine - and in several cases overcome some of the limitations imposed by the built-in Verity search engine.

Background Overview
All Verity functions can be performed through CFML templates using built-in ColdFusion tags. These tags are well documented within the CFML Language Reference Guide included with ColdFusion Studio. For information purposes we recap these tags here:

  • To create, delete, map, repair or optimize a Verity collection:
    <cfcollection action="action"
    collection="collection" path="implementation
    directory" language="language">

  • To update, add or delete keys from a collection or to purge or refresh a collection:
    <cfindex collection="collection"
    action="action"
    type="type"
    title="title"
    key="id"
    body="body"
    custom1="custom1"
    custom2="custom2"
    urlpath="url"
    extensions="file_extensions"
    query="query_name"
    recurse="yes/no"
    external="yes/no"
    language="language">

  • To perform searches on a Verity collection:
    <cfsearch name="search_name"
    collection="collection_name"
    type="criteria"
    criteria="search_expression"
    maxrows="number"
    startrow="row_number"
    external="yes/no"
    language="language">

    Types of Verity Collections
    Three types of collections can be created using Verity:

  • File: An index of one particular file
  • Path: An index of all files of a specified extension within a specified directory
  • Custom: An index of database data

    Certain situations arise where it's not clear which of the three types a collection should be. Sometimes a collection needs to be a mixture of different types of data. (The implementation of such a scenario will be discussed later.) Other caveats that occur with the file-type collection are discussed in the Allaire knowledge base, www.allaire.com/Handlers/index.cfm?ID=1600&Method=Full, and they're worth reading.

    Types of Verity Searches
    Verity searches come in two types - simple and explicit. Depending on the functionality required from your search-engine implementation, one type may be preferred over the other.

  • Simple: These types of queries allow for simple word or phrase searches using comma-delimited strings and/or wildcards. When using commas, each one is treated as a logical OR; if commas are omitted, the string is treated as a phrase. In addition, simple operators such as AND, OR and NOT can be used in simple queries.
  • Explicit: These types of queries allow for more refined searches using operators and modifiers. Operators include: <, <=, =, >, >=, Accrue, AND, CONTAINS, ENDS, MATCHES, NEAR, NEAR/N, OR, PARAGRAPH, PHRASE, SENTENCE, STARTS, STEM, SUBSTRING, WILDCARD and WORD

    The "Developing Web Applications with ColdFusion" section of the online docs included with ColdFusion Studio provides excellent documentation on the types of searches that can be performed on a Verity search engine.

    Implementation Techniques
    This section details implementation techniques that can be used to improve your Verity search engine code and even bypass the apparent limitations set by the watered-down version of Verity. All these examples work under a Windows NT environment with a Microsoft SQL Server 7.0 DBMS, but can be modified to work under any other environment.

    Overcoming Two Custom Fields per Collection
    First we address the limitation of having only two custom fields per collection. Some situations call for indexing more than two. For example, you may want to index the contents of a database table and include more than four fields to be indexed (four is the limit within a Verity collection because the body, title, custom1 and custom2 fields can hold custom information). A simple solution is to combine several fields into one, separating each field by a selected delimiter. To accomplish this you must be certain that the data in any combined field will never contain that delimiter. An example of how to create such a collection is located in Listing 1.

    Combining Database and File Data
    Under certain circumstances you may want to create a collection that's a combination of database data and file data. For example, imagine that tbl_image from Figure 1 had an additional attribute, image_text, that represented the filename of a text file that contained information associated with that image. If we wanted to create a collection that included the text in the file specified by the attribute image_text, we'd first have to query the database for the image information, then create a collection of type "file." ColdFusion's <cfindex> tag does the rest by automatically looping through the query to index the files from the paths specified in the query. Listing 2 gives an example of how this would be done.

    In Listing 2 we've dynamically created a Verity file collection that includes database information as well as data from a text file. This operation isn't limited to text files and can be performed with other types of files that Verity supports. Look closely at the code: you'll notice that when the URL attribute is created by the get_images query, an extra space is appended at the end. At first glance this may seem like a mistake, but it's deliberate and there's a good reason for it.

    When Verity performs searches on a collection, such as the one created in Listing 2, the value of the URL attribute returned by the search is a concatenation of the URL specified when the collection was indexed and the filename searched; that is, if we specified www.foobar.com as the URL, a search might return a result with the URL attribute something like www.foobar.com/file1.txt.

    ColdFusion sites that access content through URL parameters may not want files that are indexed to show up in the URL field of the returned-search query from Verity. This is where the space at the end of the URL attribute comes into play. It serves as a delimiter so that, when searches are performed, you can get the proper URL sans filename by simply applying the listfirst() function on the URL value returned. For example:

    <cfsearch collection="image_collection" name="search_images"
    type="SIMPLE" criteria="dog image" language="English">
    <!--- output all the URL values minus the concatenated filename added by Verity --->
    <cfoutput query="search_images">
    #listfirst(url, " ")#
    </cfoutput>

    Certain modifications can be made to Verity searches to make them more efficient. For instance, if you want to perform searches only on a particular image_group_id, you could use the following code:

    <cfsearch collection="image_collection" name="search_images" type="SIMPLE"
    criteria="(CF_CUSTOM2<starts>#url.image_group_id##chr(35)#)<AND>(#url.search_criteria#
    <OR>CF_TITLE<substring>#url.search_criteria#<OR>CF_CUSTOM1<substring>
    #url.search_criteria#<OR>CF_CUSTOM2<substring>#url.search_criteria#)"
    maxrows="1000" language="English">

    With this type of search in place Verity filters out all images that aren't of the image_group_id specified by the url.image_group_id parameter.

    Searches can be speeded up by periodically optimizing Verity collections. Optimization can be performed either programmatically or through the ColdFusion Server Administrator. It's a good idea to create a template that programmatically optimizes your collections and uses the ColdFusion Scheduler to run it every night. A sample template would look like this:

    <cfcollection action="OPTIMIZE" collection="image_collection">

    Finally, when performing searches on a Verity collection, certain words and characters in the search phrase will cause the search to error. To avoid this you can "clean" any search strings before you send them to Verity. A simple way to do this is to delete the offending characters and/or words. A utility like this already exists - in the form of a custom tag named <cf_verityclean> - and can be downloaded for free from Allaire's Developer's Exchange Site at www.allaire.com/developer/gallery.cfm.

    Scaling Verity for Clustered Web Server Environments
    Within clustered-server environments traditional implementations of Verity wouldn't be ideal. Under clustered NT environments collections could be stored on a separate file server that all the Web servers can access via UNC paths or SMB mapped drives. The problem with such an implementation is that the Web servers themselves are doing the searching, that is, the local Verity engine on each ColdFusion Web Server is taking up that server's CPU time to perform searches and updates to various collections. Clearly the main function of a Web server should be to serve Web pages, and any CPU time taken for other tasks is highly undesirable. This situation is analogous to placing a DBMS on each server, then having each one serve Web pages and perform database queries. In our experience making network calls to collections via UNC paths is a slow process.

    A more scalable and robust solution to this problem is to designate one server as a Verity server. This server will then take Verity search-and-update requests from all the Web servers through HTTP calls. To accomplish this without purchasing the full-scale version of Verity, CFML client and server templates must be implemented. The client template will reside on each of the Web servers and will be called when a Verity search or update is performed. Subsequently the client template will call the server template residing on the dedicated Verity server via an HTTP call. Each client HTTP call posts requests to the server template, and once the server template receives a request, it performs the desired action and returns results to the client template. The client template can then use this data in any fashion desired.

    Listing 3 provides a sample client template, and Listing 4 illustrates a complementary server template. Figure 2 gives an overview of the entire client/server model.

    Note: Although the purpose of the Verity client/server template is to make Verity scale better, calling the <cfhttp> tag is a potential bottleneck that limits the scalability of this implementation. Due to problems encountered with the single-threaded nature of the <cfhttp> tag, it's good programming practice to place a lock around all calls to it. This locking mechanism, which is responsible for the consequent scalability limit, causes multiple images of templates that call <cfhttp> to wait for the release of the lock before execution.

    Conclusion
    The built-in Verity search engine packaged with ColdFusion can be augmented by implementing the tips and tricks illustrated in this article. The result is a more robust and scalable tool, developed in a relatively short amount of time. Best of all, the scalability attained is free. It's a combination of features any developer will appreciate.

  • More Stories By Bryan Murphy

    Bryan Murphy is the owner of GuardianLogic, Inc. (www.guardianlogic.com), an information security firm that provides application and network vulnerability assessments and hardening. He is also one of the authors of Metazoa (www.metazoa.ca), a security-enhanced content management system; Membrane, an application-level firewall; and MetaGuard, a CFC that provides role-based login, authentication, and access control. Bryan has been an ethical hacker since the old-school BBS days. Visit his blog at www.downgrade.org.

    More Stories By Shahriyar Neman

    Shahriyar Neman is CTO of the Next Network, an ASP that delivers total computing packages to small- and medium-sized businesses through the Internet. He holds
    a BA in computer science from NYU and is currently
    pursuing his master's degree.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.