Welcome!

ColdFusion Authors: Maureen O'Gara, John Ryan, Fuat Kircaali, Yeshim Deniz, Anatole Tartakovsky

Related Topics: ColdFusion

ColdFusion: Article

Back to Basics

Back to Basics

I read somewhere that when faced with a task that takes one hour to do manually, or one hour to automate, a good programmer will choose to automate the process. As ColdFusion developers, we often face this decision when we need to programmatically use data contained in a text file.

There are two ways to access this data - automating the process by parsing the text file or manually inputting the data. This article presents some basic techniques that can help you make the choice to be a "good programmer" by automating the process.

Bulk upload and data import are the two primary types of functionality in which text file parsing is used in Web application development. The main difference between these two is the format of the text file being parsed. Bulk upload functionality parses text files designed specifically to be processed programmatically, while data import functionality tries to parse files created for human consumption. To handle these differences, slightly different techniques are required.

Bulk Upload
Figure 1 shows an example of a file created for bulk upload. It contains a listing of new employees who need to be added to an employee database. Because this file was created specifically for computer processing, it was created using a table structure. The first row contains the column names, and each subsequent row contains one employee record with name, phone, and e-mail address columns.

The table structure makes parsing this file pretty easy with either the cfloop tag or the cfhttp tag. To use the cfloop tag, first read the file into a variable (str_Content in the example below) using the cffile tag and then loop over the content using the correct delimiter (carriage return/line feed - chr(13)chr(10) - on Microsoft Windows and line feed - chr(10) - on Unix). Use the listGetAt() function to access specific fields in each line.

<cfset bln_FirstLine = true>
<cfloop
list="#str_Content#"
delimiters="#chr(13)##chr(10)#"
index="str_Line">

<!--- Ignore the column name
line --->
<cfif bln_FirstLine is false>
<cfset str_Name =
listGetAt(str_Line, 1, ",")>
<cfset str_Phone =
listGetAt(str_Line, 2, ",")>
<cfset str_Email =
listGetAt(str_Line, 3, ",")>
<cfelse>
<cfset bln_FirstLine = false>
</cfif>
</cfloop>

For rigidly formatted documents accessible via HTTP, the cfhttp tag makes parsing the document even easier. Given the right parameters, this tag will automatically parse the content and return a query object containing the results. The following example produces the output shown in Figure 2:

<cfhttp method="GET"
url="#dir_URL#/bulkupdate.txt"
name="qry_Contents"
delimiter=","
textQualifier="">
</cfhttp>

<cfdump var="#qry_Contents#">

Data Import
Figure 3 shows an example of a file that was not created to be used programmatically. It is an employee directory designed to be viewed by humans. Parsing files like this one is more complicated because each line cannot be processed in the same way. The logic must first determine what information is contained in the current line and then process the line accordingly. One way to do this is to track the current line type with a variable. After each line is processed, you should be able to infer the type of line that will be processed next, based on the current line type, and set the variable accordingly.

For example, when processing the line containing the employee's name, you know the next line will contain the employee's phone number. Therefore, after processing the name line, you set the line type equal to "phone" and loop. On the next loop, the appropriate logic processes the phone line, sets the line type back to "name," and the process repeats.

<!--- The first line contains
the employee's name --->
<cfset str_LineType = "name">

<cfloop list="#str_Content#"
delimiters="#chr(13)##chr(10)#"
index="str_Line">

<cfif str_LineType is "name">
<cfset str_Name = str_Line>
<cfset str_LineType = "phone">
<cfelseif str_LineType is "phone">
<cfset str_Phone =
listGetAt(str_Line, 2, ": ")>
<cfset str_LineType = "name">
</cfif>
</cfloop>

CFML Parsing Functionality Limitations
As the examples above show, ColdFusion has several powerful built-in functions and tags that you can use for parsing, such as listGetAt(), cfhttp, and cfloop. Unfortunately, these functions share a common limitation: they treat consecutive delimiters as one delimiter. For example, the built-in ColdFusion functions consider the line "Jim Doe,,jim@acompany.com" to have only two tokens ("Jim Doe" and "jim@acompany.com"), even though the strings are separated by two commas.

This was not a problem in the above examples because there were no empty fields in the data being processed. If the data did have empty fields, however, the example code would process it incorrectly. Consider the result of parsing the line "Jim Doe,,jim@acompany.com" with the bulk update code in the first example. Since the listGetAt function treats consecutive delimiters as one delimiter, the code would set str_Phone equal to "jim@acompany.com" and str_Email equal to a blank string. Obviously, this is incorrect.

My Solution
I have developed a ColdFusion tag to address this shortcoming: TextParse.cfm (Listing 1). The TextParse tag treats consecutive delimiters as delimiters surrounding an empty string. Therefore, it considers "Jim Doe,,jim@acompany.com" as having three tokens - "Jim Doe", "", and "jim@acompany.com" - rather than two.

The code implementing the TextParse tag is relatively straightforward. From a high level, it simply loops over the delimiters in the content, and extracts the strings that fall between the delimiters. This logic is actually done twice, once in the tag's body and once in the function TokenizeLine(). The tag body breaks the content into separate lines and TokenizeLine() then breaks each line into tokens.

Like cfloop, the TextParse tag is used by placing the code to process each line between its start and end tags. The tag takes three parameters: str_Filename, str_LineDelimiter, and str_TokenDelimiter. The variable "str_Filename" expects the full path of the file to be parsed. "str_LineDelimiter" allows you to specify the delimiter used to separate the lines in the file (by default "#chr(13)##chr(10)#"), and "str_TokenDelimiter" allows you to specify the delimiter to separate the tokens in each line (by default ","). The TextParse tag returns two variables to the caller scope: TextParse.str_Line and TextParse.ar_Tokens. TextParse.strLine is a string containing the complete current line and TextParse.ar_Tokens is an array containing the tokens of the current line.

The following example demonstrates the use of the TextParse tag to parse BulkUpload.txt. Because text files often have inconsistent formats, I wrapped the token array accesses in a cftry/cfcatch statement. Without the error handling, the example code would generate an error when processing a line with fewer than expected fields. Although this seems to complicate things, it's good to know when a line doesn't have the right format, since your parsing logic might otherwise handle it incorrectly. The error handling allows you to flag the line for later examination, and continue processing.

<cf_TextParse
str_Filename="c:\BulkUpload.txt">

<cftry>
<cfif TextParse.ar_Tokens[1]
is not "Name">
<cfset str_Name =
TextParse.ar_Tokens[1]>
<cfset str_Phone =
TextParse.ar_Tokens[2]>
<cfset str_Email =
TextParse.ar_Tokens[3]>
</cfif>

<cfcatch>
<!--- Handle error --->
</cfcatch>
</cftry>
</cf_TextParse>

Conclusion
ColdFusion developers can use text file parsing to import data meant for human consumption and to allow Web site users to make bulk uploads. Using text file parsing effectively requires knowledge of basic techniques as well as the limitations of ColdFusion's parsing functionality. Armed with the techniques presented above and the TextParse tag, the decision to automate data import and bulk upload processes should be an easier one. Having made the choice to automate, you can then be confident in your status as a "good programmer."

About Christian Thompson

Christian Thompson is a certified advanced Macromedia ColdFusion MX developer. He is a senior software engineer for Inserso, a technology consulting firm headquartered in Annandale, VA, where he has specialized in ColdFusion application development for over two years.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
DeUndre' Rushon 04/16/08 05:01:14 PM EDT

In the code below:

<cfset bln_FirstLine = true>

<!--- Ignore the column name
line --->
<cfif bln_FirstLine is false>
<cfset str_Name =
listGetAt(str_Line, 1, ",")>
<cfset str_Phone =
listGetAt(str_Line, 2, ",")>
<cfset str_Email =
listGetAt(str_Line, 3, ",")>

<cfset bln_FirstLine = false>

Is there a possibility that "index" variable within the might night be utilized within other functions used inside the tag?