Welcome!

ColdFusion Authors: Maureen O'Gara, Hovhannes Avoyan, Yakov Fain, Pat Romanski, Liz McMillan

Related Topics: ColdFusion

ColdFusion: Article

Virtual Arachnophobia

Virtual Arachnophobia

There once was an emporium that sold the finest treasures in all the world. Their service was second to none, and their prices were more than reasonable.

But they had a problem. The problem was that only a few knew where the emporium was, and most didn't even know it existed. Sadly, the emporium failed because their potential customers couldn't find them.

A similar tale of woe could be told of many would-be-great Web sites. Web sites destined to fail because they ignore the first rule of retail success: "Location, location, location." What is true in the physical world is even truer of the virtual world. If your customers can't find you, they're not your customers.

This story was also true for me. I had created a dynamic, database-driven Web site for Blue Star Training, the company I work for. I took great pains to get it listed with all the best search engines. So I was shocked to learn that the "meat" of my Web site, the dynamic pages, weren't indexed by the search engine's spiders.

The Problem with Spiders
Web spiders are systems designed to crawl through a Web site and index the content and its links. Many search engines launch a spider when a request is made to index a site. Most spiders will happily follow all links except those that contain a question mark. Did you catch that exception? That exception means that pages with URLs that pass parameters are not indexed. This is a serious problem for ColdFusion sites because dynamic URLs are the lifeblood of a dynamic site.

I Don't Know What's Worse, the Bugs or the Spiders
I was suffering from the exact problem stated above. I had designed a catalog of courses for our IIS-hosted ColdFusion Web site. It had two main pages, the page for displaying short summaries of each course in the catalog and the page that displayed the details about a given product.

The summary page had a link for each product that took you to the detail page. The link for a given course looked something like this:

<A HREF="CourseInfo.cfm?COURSE_ID=HT1">

Did you notice the question mark? The pages that I needed indexed the most, the course information pages, were not indexed by the spiders. The summary page was indexed, but that's not enough to ensure a good location in the search engine listings.

Ben Forta Offers a Solution
Like any smart ColdFusion developer, I went to the Developer Support Forums on Allaire's Web site to see if anyone had already solved my problem. Eureka! I found a solution written by Ben Forta (http://forums.allaire. com/DevConf/Index.cfm?Message_ID=18401).

Ben's solution was to replace the question mark with a forward slash and list any parameter values like this:

<A HREF="CourseInfo.cfm/HT1">

This would cause the spiders to see the URL as a static link and follow it. The next step would be to parse CGI.PATH_INFO to extract the value being passed.

I Still Have Problems
Ben's solution was good, and it worked, but it presented a few problems for me. The first problem was that my image tags like this

<IMG SRC="images/Logo.gif">

looked for the image in the path /CourseInfo.cfm/images/, which of course didn't exist. This behavior was caused by the browser mistaking CourseInfo.cfm for a directory. This was easily overcome by the use of <BASE HREF="http://www.bluestarcorp.com/">.

The second problem wasn't so easy. I had my IIS server set to check for the existence of every .cfm file before passing the page request to ColdFusion. I did this so I could catch "404 Not Found" errors and display a custom error page.

In case you didn't know it, IIS passes requests for all .cfm files to the ColdFusion server, and it's ColdFusion that displays the error if the file can't be found. The error message that ColdFusion displays is an ugly and unfriendly one that you can't modify. To prevent this you need to tell IIS to check whether the file exists before the request gets passed to ColdFusion, and let IIS route browsers to a custom error page.

Trapping the 404 Errors
Let me tell you how I trap the "404 Not Found" error, because it'll be important later. From the Microsoft Management Console in IIS you select your site in the left panel and right-click and select "Properties". Next, click on the "Home Directory" tab and the "Configuration" button (see Figure 1). Locate and select the .cfm application mapping in the list (see Figure 2). Click the "Edit" button to display the mapping properties. Finally, click the "Check that file exists" checkbox (see Figure 3) and then click the "OK" button.

Now IIS will check for 404 errors, but you still need to tell IIS what page to display when a 404 error occurs. Back in the Web site properties you click on the "Custom Errors" tab. Locate and select the 404-error entry from the list (see Figure 4) and click the "Edit Properties" button. With the "Message Type" drop-down list select URL. In the "URL:" text input, type the URL for your custom 404-error page (see Figure 5). Finally, click the "OK" button to close the dialog box and then click the "OK" button to close the properties window.

How I Found a Solution by Solving Another Problem
I wasn't having any luck finding a solution to the dynamic URL problem, so I began working on another problem I had. The problem was that I had converted www.bluestarcorp.com from HTML to CFML, and I had old entries in the search engine listings that still pointed to the old HTML pages. I still wanted to get the traffic from the old listings, but I didn't want to have to do a lot of work.

Helping Broken' URLs Find the Way Home
That's when the thought hit me: all traffic to my site that is for a nonexistent page goes through the custom 404-error page. Since I made my custom error page a ColdFusion template, I could check what page the browser was requesting and route it to the corresponding CFML page. If the browser requested calendar.html, I'd route them to calendar.cfm.

So I wrote a little code to test if the requested file name ended in .html (see Listing 1). If it did, I replaced the .html with .cfm and used <CFLOCATION> to take the browser to the correct page; if it did not, I displayed the error message.

You might be wondering, What if a request is made for an HTML file that doesn't have a corresponding CFM file? Well, in that case the code would route the browser to a nonexistent CFM page and would end up back on the 404-error page and finally get the 404-error message. I was examining my 404 handling code one day when the second idea hit me.

The Plan of Attack
I reasoned that if I could parse the URL and check for .html from the 404-error page, I could parse the URL for anything. Here's the idea: create a link to a page that doesn't exist, then parse the URL to get to the desired page and pass any parameters. The transformed URL would look like:

<A HREF="DYN_HT1.cfm">

Once more I wrote a little code, this time to check whether the requested file name started with "DYN_" (see Listing 2). This is my signal that the URL is not real but must be parsed. To parse out the parameter (in this case my course code), I locate the first character after the "DYN_" marker and the last character before the .cfm extension and store all characters between these points in a local variable. Following the example so far, that would mean that I would store HT1 (my course code) in that local variable.

Now the plan was to use <CFLOCATION> to route the browser to the appropriate page along with the question mark and the param- eters. Well, it worked, but the spiders don't look just at the original link address, but also at the address reported in the resulting page's HTTP header. Because I used a , the URL in the HTTP header would be "CourseInfo.cfm?COURSE_ID=HT1". The spider once again refused to index my page.

What the Spider Doesn't Know....
I was at the end of my rope when I remembered one of ColdFusion's greatest strengths, the <CFINCLUDE> tag.

My thinking was, if I CFINCLUDE the contents of the desired page (the details page in this case) into the 404-error page, the spider would never know it. Now you need to know that when the 404-error page is sent back to the browser, the originally requested URL is the one written to the HTTP headers. That means that the spider will see the desired page but will "think" it's coming from the static URL.

One Last Hurdle
I went back and modified my code to CFINCLUDE the "CourseInfo.cfm" template along with the desired course ID (see Listing 3). Guess what?ŠYou guessed it. It didn't work. The CFINCLUDE will not take URL parameters. It makes sense that CFINCLUDE won't take parameters because the template in question is not being evaluated but is being merged into the 404-error template.

I had come too far to let this stop me. I knew that you could create form variables on the fly. So, I reasoned, why not URL variables?

I went back yet again and modified my code, this time to create a URL variable and then CFINCLUDE the "CourseInfo.cfm" template (see Listing 4).

Eureka once again! It worked! I could now use the URL "DYN_HT1.cfm" and get to the same place as if I had used "CourseInfo.cfm?COURSE_ID=HT1".

I went back and modified my summary page to generate links in the DYN_COURSEID format for each course as it was read from my database. I tested several of my links. Then ­ with a little bit of trepidation ­ I submitted the catalog summary page to several search engines.

I'm happy to say that my pages indexed with every search engine I submitted them to. The spiders never knew that the pages they indexed were dynamically generated.

How It All Works
Let's recap how this technique, which I call "Faking a URL," works (see Figure 6). The page request comes from the browser in the form "DYN_HT1.cfm". IIS tests to see if "DYN_HT1.cfm" file exists, which it doesn't. Because it doesn't, IIS routes the browser to the 404-error page. The page "sees" that the file name starts with "DYN_" and parses out the course ID parameter (HT1 in this case). The 404-error page code CFINCLUDEs the CourseInfo.cfm template. The CourseInfo.cfm template code reads the value in URL.COURSE_ID and displays the correct content for the HT1 course.

The Right Tool for the Right Job
You shouldn't use the "fake URL" technique to trick the spider for just any dynamic page. It's best suited for dynamic pages that display mostly database output, like my course catalog. The catalog information doesn't change very often, but often enough that I wouldn't want to "hard-code" my pages. If I make major additions or changes to my course catalog, the spider will pick them up on the next visit.

Keep in mind that you still have the flexibility of linking to the dynamic page with the old method via a URL like "CourseInfo.cfm?COURSE_ID=HT1".

This was a huge timesaver for us because we have links to our courses all over the Web site. But we only had to change the links on the catalog page to account for the spiders. We don't care that all the other dynamic links are ignored, because to spider them would be redundant.

Extending the Fake URL' Technique
You can add additional blocks of code to the 404-error template to process different prefixes. We have a second prefix, "BIO_", that we use for instructors. You can add as many prefixes as you need.

You can also extend the code to accept multiple parameters. Thus, if you wanted to have one page for sale items costing less than $10, you could use a URL that looks like "ITM_CATEGORYID-10.cfm" to display the list "bargain items". The best way would be to treat the parameter list like a ColdFusion list, with a dash as the delimiter, and then use CFLOOP to "break out" each of the parameters for use.

Use your imagination and see what you can come up with. If you find an interesting way to extend the fake URL system, I'd love to hear from you.

More Stories By John Morgan

John Morgan writes courseware at Blue Star training when he's not busy training programmers, Web developers and database developers. He also speaks at conferences, workshops and the San Diego ColdFusion Users Group, which he hosts.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.