Welcome!

ColdFusion Authors: John Ryan, Fuat Kircaali, Yeshim Deniz, Anatole Tartakovsky, Steve Lesem

Related Topics: ColdFusion

ColdFusion: Article

Build A Web Spider In 40 Minutes

Build A Web Spider In 40 Minutes

Using ColdFusion to glue together Verity 97 and a popular offline browser results in a powerful but low-cost and easy-to-create searchable Web spider. Past issues of CFDJ have laid down a great foundation for creating and optimizing Verity collections.

This article expands on CFDJ's Verity-related articles by demonstrating how to create an Internet Web spider not unlike Yahoo. Since Verity and the offline browser utility do most of the work, simple ColdFusion code is used to create the searchable Web spider.

Background
According to whatis.com, a spider is "a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index." A spider is only one piece of a Web search engine. We also need a tool to create a Web site search index (or collection) and a tool to search against the index with the user's search criterion. We'll be using a tool called Teleport Pro for the spider and Verity and ColdFusion to create and search against the collection.

You may be asking, "Why would I want to build a searchable Web spider?" The Verity 97 engine that's included with ColdFusion Server is designed to index content stored on your file system or in your database. The Verity engine's inability to build indexes by spidering your Web site creates a few problems, especially for complex, dynamically generated content sites. Significant differences can exist between raw content that's stored on your server and the way this content is presented to your customers. Another issue is the inability to include external links from your Web site in the search index.

There are many options for creating a searchable Web spider. Searchable Web spider tools can cost more than $30,000. There are also several application service providers that offer Web spider search engines at a significant cost. Either of these two expensive options is great for well-funded projects, but what are the rest of us supposed to use?

You may be thinking that you should just write your spider in C or, even worse, write it in ColdFusion using the CFHTTP tag. Why write something that probably already exists and besides, your valuable time should be used to improve your Web site. The real power of ColdFusion is its ability to glue together powerful widgets like Teleport and Verity to create a more powerful tool than either component.

Introducing the Spider, Teleport Pro
Teleport is designed to download Web sites to your hard drive for offline browsing, but we're going to use it to help create our searchable Web spider. When configured correctly, Teleport creates a mirror of Web sites to the file system while leaving paths, filenames, and URL parameters intact, for the most part. To create our searchable Web spider we're going to build a Verity collection from a locally stored mirror of a Web site. There are some issues with URLs containing characters that are not valid filenames, but we'll discuss them later.

Teleport can spider Web sites that use frames, image maps, HTML 4.0, CSS 2.0, DHTML, and XML. Teleport is multi-threaded and uses multiple virtual browsers to simultaneously download a Web site. There are different versions of Teleport, including one that's command-line driven, but we'll be using Teleport Pro, which costs about $40 for the full version and can be evaluated for free. Teleport Pro includes an easy to-use GUI and Web site download scheduler.

Setting Up the Spider
The first step in creating our searchable Web spider is to download and install Teleport Pro. Teleport Pro v1.29 supports Windows 95, 98, NT, and 2000, and is available in multiple languages. The evaluation version of Teleport can be downloaded at www.tenmax.com/teleport/pro/. Even though the evaluation version is limited to spidering 500 links, it'll be enough for this tutorial.

Now that Teleport is downloaded and installed on your development computer, we'll use it to create a mirror of the Web site we want to search against on the local hard drive. For this example, we're going to mirror HTML Goodies, a popular Web site for learning HTML.

From the Teleport Pro program, select File > New Project Wizard. Select the "Duplicate a Web site..." option and click the "Next" button. Enter "http://htmlgoodies.earthweb.com/" as the starting address. It can take a long time to download an entire Web site, so we'll limit the link search depth to two links from the starting point. Click on "Next." We only need text for our Verity collection, so select the "Just text" option and click "Next." Click "Finish" and save your Teleport project to "C:\SpiderData\HTMLGoodies.tpp".

To download all Web site file extensions and to prevent Teleport from adding .htm to the downloaded files, we must visit the Project Properties window. Select Project > Project Properties, then the "File Retrieval" tab. Select the "Retrieve all files..." option and uncheck all the "Retrieval Modes" as seen in Figure 1. Now select the "Browsing/Mirroring" tab and uncheck everything except "Replicate the directory..." as in Figure 2. Click on "OK" and save the project again by selecting File > Save Project.

Now we're ready to download the Web site. If you have a low bandwidth connection to the Internet, spidering and downloading can take several minutes. If you're behind a proxy server, or want to use a connection other than your Windows default Internet connection, you'll have to visit the File > Proxy Server or File > Connection windows. When you're ready, start your project by selecting Project > Start.

After the project has finished downloading the Web site, use Windows Explorer to browse the directory structure under "C:\SpiderData\". You'll discover the directory structure and files that Teleport Pro resolved by evaluating links from the HTML Goodies Web site.

Building the Verity Collection
Using the code in Listing 1, we can now create a new Verity collection that indexes the mirror of the HTML Goodies Web site that resides on the local file system. Our first task is to build the collection, which will act as a home to our Verity index.

<CFCOLLECTION
ACTION="CREATE"
COLLECTION="HTMLGoodies"
PATH="C:\CFUSION\Verity\Collections\">

The ACTION="CREATE" parameter creates the directory structure for the index and various support files. The COLLECTION parameter specifies the name for the new collection. PATH indicates the location where the collection indexes and supporting files will be stored.

After defining the collection, we need to populate it with the CFINDEX tag as follows:

<CFINDEX COLLECTION="HTMLGoodies"
ACTION="REFRESH"
TYPE="PATH"
KEY="C:\SpiderData\HTMLGoodies"
RECURSE="YES"
URLPATH="http://"
EXTENSIONS=".htm,.html">

The COLLECTION parameter specifies the collection name that was created earlier by CFCOLLECTION. ACTION="RE-FRESH" tells Verity that we'll be adding to our existing collection, which at this point is empty. When the TYPE parameter is set to "PATH", the KEY parameter specifies the parent directory that the collection will be populated from. RECURSE="YES" tells Verity to include data from all the subdirectories below the parent directory specified in the KEY parameter.

The string in URLPATH is prepended to the CFSEARCH URL attribute. In our case this results in "http://" being concatenated with our CFSEARCH results such as "htmlgoodies.earthweb.com/ index.htm".The EXTENSIONS parameter filters Verity indexing on a comma-separated list of file extensions so be sure to include all the possible extensions that may exist for the Web site that's being searched against. In our case, HTML Goodies consists mainly of static HTML files so .htm and .html filtering is appropriate. More detailed information about creating and optimizing Verity indexes is available from previous CFDJ articles.

Creating the Web Spider Search Templates
Now that we have a Verity collection populated with files mirrored from the HTML Goodies Web site, we need to create the templates to receive the user's search criterion, execute it against the HTML Goodies collection, and display the search results.

The template Search.cfm (see Listing 2) is a standard form that allows users to input a search criterion and decide if they want to see a search summary. Search.cfm posts to Search_Action.cfm where the real work occurs.

Search_Action.cfm executes the search criterion against the HTML Goodies Verity collection and displays the formatted search results as seen in Figure 3.

The code for Search_Action.cfm is in Listing 3 and unlike the Search.cfm form, some of the code deserves explanation.

<CFSET TickBegin = GETTICKCOUNT()>
<CFSEARCH COLLECTION="HTMLGoodies"
NAME="Search_Results"
CRITERIA="#FORM.Search#"
TYPE="SIMPLE">
<CFSET loopTime = GETTICKCOUNT() - TickBegin>

The above CFSEARCH tag executes the search criterion entered by the user, #FORM.Search#, against the HTML Goodies collection and returns the resulting hits. It's important to note that CFSEARCH has a serious limitation - it throws an error if the search results exceed 64K in size. CFSEARCH will return a 64K limit error even if the CFSEARCH MAXROWS parameter is set to a low number. CFSEARCH may also throw an error if invalid search criterion is entered. There are various custom tags available to filter bad search criterion and TRY/CATCH blocks should be used around all CFSEARCH tags to trap error conditions.

Those of you who have done ColdFusion performance tuning will recognize the CFSETs around the CFSEARCH tag. GETTICKCOUNT () is a useful function because it returns the clock counter value in milliseconds (ms). In this case I'm using it to determine the execution time for CFSEARCH, but the same code can be used to optimize your code by timing sections of your CFML.

The next part of the code in Listing 3 simply displays some summary information about the CFSEARCH that just occurred. Search_Results.RECORDCOUNT is the total number of records returned from the CFSEARCH.

<CFOUTPUT QUERY="Search_Results"
MAXROWS="#MaxLinks#">

The CFOUTPUT tag loops through the CFSEARCH results query and builds a new table row for each search hit. The MAXROWS parameter of CFOUTPUT limits the displayed results to 10 rows in this example because we've set MaxLinks = 10.

<CFSET Link= REPLACELIST
(Search_Results.URL,
"\index.htm", " ")>

There are quite a few interesting pieces to this seemingly simple REPLACELIST function so we'll take it one step at a time. Search_Results.URL is fundamentally the relative file path to the search hit under C:\SpiderData\HTMLGoodies with the back slashes reversed to forward slashes to make it a valid URL. Search_Results.URL is also the result of appending the slash-corrected string with the string value specified earlier by the URLPATH parameter in the CFINDEX tag (see Listing 1). In our case, Search_Results.URL equals the URLPATH http://" prepended to a string such as htmlgoodies.earthweb.com/beyond/intbrowtest.html". This results in the fully qualified URL "http://htmlgoodies.earthweb.com/beyond/intbrowtest.html".

When the Teleport Pro spider encounters a link such as http://htmlgoodies.-earthweb.com/primers/, it can't determine the default template name so it assumes the name is index.htm. The actual URL could be http://htmlgoodies.earthweb.com/primers/default.cfm, but if the link to this page doesn't include the filename, Teleport Pro will mirror it to the local file system as htmlgoodies.earthweb.-com\primers\index.htm. The REPLACE- LIST function replaces any "index.htm" strings with a space to eliminate this issue.

REPLACELIST allows the replacement of multiple substrings with different values. You might be asking yourself, since we're replacing only one string, why use REPLACELIST instead of the REPLACE function? It was done out of anticipation that you may want to replace "-" with "?" for Web sites that use URL parameters. Teleport Pro downloads a Web site to the local file system, and there are a few characters that are valid in a URL but are not legal as file system characters, including the question mark character used for URL parameter passing. Teleport Pro converts all illegal file system characters found in the URL to "-". Teleport Pro stores the URL http://www.semiconbay.com/go.cfm?CID=17987 as http://www.semiconbay.com/go.cfmCID=17987 to the file system because the question mark is not a valid file character. Modifying REPLACELIST to (Search_ Results.URL, "\index.htm,-", " ,?") will put the question mark back into the URL, repairing the change made by Teleport Pro.

The following code displays the summary results for each search hit and highlights any keywords that were discovered.

<CFSET SearchWord =
LISTFIRST(FROM.Search,' ')>

<CFSET Context = REPLACENOCASE
(Search_Results.SUMMARY,
"#SearchWord#","<SPAN STYLE=
'background:FFFF00'>
<B>#UCASE(SearchWord)#</B>
</SPAN>",'ALL')>

#Context#<BR>

The search criterion entered by the user may include logical operators and multiple search terms so the LISTFIRST function is used to find the first word of the search criterion. If the user enters "browser AND test", SearchWord will equal "browser".

The next CFSET highlights the SearchWord if it exists in the summary results. Search_Results.Summary is a field that's returned by CFSEARCH. When CFINDEX is called, Verity "automagically" guesses which three sentences are most important to the document being indexed and stores up to 500 characters by default. The number of sentences stored and the maximum character size can be changed by editing "style.prm" in the file/style directory of any Verity collection. A good paper on this and other Verity details is available at www.allaire.com/handlers/index.cfm?ID=18429.

As seen from Figure 3, the summary results generated by Verity don't necessarily include the keyword we want to highlight. If we wanted to ensure that we showed the keyword context, CFFILE could be used to retrieve the HTML for each hit. We could then remove the HTML tags using REREPLACE, locate the first occurrence of the keyword, and then show the keyword highlighted in context. The first version of the Web spider used this method, but execution times increase by an order of magnitude. With search engines, speed is important so the trade-off was probably a good one.

Conclusion
Quite a few improvements can be made to our simple Web spider. By mapping a Web server virtual directory to C:\Spider-Data\, we could allow users to view cached versions of the Web pages similar to the way Google does. Scalability issues with Verity might be improved by distributing the load across multiple servers.

Adding Web sites to the Teleport spider can be done more powerfully by using the command-line driven version of Teleport or by replacing it with an even better offline browser. Downloading Web sites and updating and optimizing the Verity collections can be fully automated using the Teleport and/or the ColdFusion scheduler. This article should be a helpful starting point for building your own low-cost searchable Web spider. To view my implemented version, go to www.semiconbay.com, enter your search criterion into the "Search the Site" box in the top left, and click "go."

Editor's note: This article pertains to CF 4.5 or lower. Creating a Web spider using CF 5 will be covered in a future article.

About Michael Barr

Michael Barr is the Director of Internet Technologies at the Silicon Valley company semiconbay (www.semiconbay.com). Michael is an Allaire Certified Professional and is an active participant in the Bay Area ColdFusion User Group (BACFUG).

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
ba 09/20/01 02:57:00 PM EDT

I didn't understand it.
What do you meen?