Welcome!

ColdFusion Authors: Yakov Fain, Pat Romanski, Liz McMillan, Maureen O'Gara, Greg Ness

Related Topics: ColdFusion

ColdFusion: Article

Thinking Outside the Table PART 1

Bulk inserts

A two-part series looks at techniques for shifting workload away from the application server onto the database by using "extra" database tables.

Most ColdFusion programmers understand that when it comes to bulk inserting into a database, it isn't good practice to loop over text files one line at a time with a <cfloop>. However, when faced with the realities of a data feed that needs preprocessing, comparing to existing data, and then postprocessing, good intentions sometimes fall by the wayside.

The theory is simple - if you have a large amount of data (typically tabbed or comma separated text) to import into your database, the best way to insert it is in a single import - either directly from SQL (with a bulk insert command for SQL Server), or else via an import utility (such as DTS for SQL Server).

In practice, it's a little more difficult than that, and far too often import routines are constructed in ColdFusion that read the complete text file, loop over it line by line, validate the data and format, and then either update an existing record or insert it into the database.

This technique is inefficient, slow, and prone to "falling over," but it's a technique that is used time and again, usually because of one of three types of problems:

1.  The imported data is incorrectly formatted and needs preprocessing: For instance, you may be importing a feed where the country names in addresses are free text and not ISO codes, or values such as "England" or "Hawaii." Often these errors are consistent and can be fixed with ColdFusion but can't be imported directly into a table with a foreign key constraint requiring strict two-character country codes.
2.  Existing records need updating and new records need adding (the data needs to be "matched and added"): Matching and adding often take place where data is supplied from an external system (e.g., user data). Existing users may have passwords or other data unique to the Web database that you don't want to overwrite. You could fudge this with one-to-one joins to separate tables (assuming you have a unique key to do it), but the neater way is to update existing data and add new data. A third stage in the process is often a "delete" - the removal of data no longer present in the feed. Ideally this stage should really be a "change status to expired," but "match, add, and delete" is easier on the tongue.

With a line-by-line import, the usual way of determining whether a record is already in the system is to query the database - meaning that for every line in the import you end up with two queries. It can be the most inefficient part of the whole process.
3.  Business rules need to be applied to the data (postprocessing): The final issue is the need to apply business rules after an import. Typically requiring large doses of flexible ColdFusion code, business rules are difficult to apply to data imported in a batch.

Examples of business rules that I've come across that were cited as preventing batch import include allocating free magazine subscriptions to new subscribers of a different magazine, and e-mailing all subscribers to a magazine telling them they had free access to a Web site. Both of these were far easier to achieve in ColdFusion than pure SQL, and still are, even though the import process itself now runs as a batch.

The Extra Table
All these issues can be dealt with without resorting to line-by-line imports by using a very simple technique - the addition of a separate import table to hold the data while it's processed.

Instead of importing data directly into its intended destination table, it's imported into a dedicated import table that's a copy of the destination table without any constraints, keys, triggers, or anything else that might prevent a bulk import.

Once the data has been imported, it's then processed to ensure it conforms to all the constraints on the destination table. The data can either be fixed if there are consistent structural problems with the feed, or else removed from the import process (depending on the situation you might simply delete records or, at the other end of the scale you might have a third table to store incorrect records for offline fixing).

After the preprocessing, the process of updating the live data can begin - the process of "match, add, and delete". The crucial difference between doing this with an import table and with line-by-line is that with an import table you can use a single SQL command to match all the existing records, a single command to add all the new records, and a third single command to delete (or change the status of) any records no longer in the feed.

Importing the Data
There are various techniques you can use for importing the data, and this article doesn't cover them. BULK INSERT in SQL Server works well with tabbed text. Running DTS via SQL commands was covered in CFDJ in September and October 2003, and is needed if you have field qualifiers in your feed (e.g., ""s around the fields). Whatever technique you choose, ensure that it imports the complete data set into an import table for preprocessing.

Checking for Errors
The next stage in the process is to check for data integrity. This might mean applying rules such as requiring e-mail addresses to be in the format %@%.%, or it might consist of "normalizing" values such as country names or categories. For instance, data may be imported with a category of "News," which needs to be translated to a numeric key value from a related table.

All this can be done with either SQL or ColdFusion. Either way, the key is to write SQL that acts on sets of data rather than one line at a time. What you do with records that contain errors depends on the needs of your application. You can delete the records or copy them to another text file. Whether you're attempting the process in SQL server or ColdFusion, the ideal method is to have some way of marking them in the database.

Either use a status field in the import table or else (as I prefer to do) have a separate error table: a copy of the import table with extra fields to indicate the type of error and the date it occurred.

The exact details of this stage will depend on your requirements, but the principle is always the same - after this process, all data in the import table is ready for inserting into or updating the main table.

Matching
The match part of the process is the easiest and roughly the same for all RDMSs. Simply update all necessary fields where the existing data matches the import data:


UPDATE  table1
SET     field1 = import_table.field1,
      ... fieldn = import_table.fieldn
FROM    table1, import_table
WHERE   table1.keyfield = import_table.keyfield

The key field doesn't necessarily have to be the primary key of the main table - it just has to be a unique set of data that can be used to identify the existing data. If you're importing user data, this is often the e-mail address. There should be a unique constraint on this field in the main table if it isn't the primary key (which is unlikely).

Adding New Records
Adding new data is not so simple. By far the easiest method (and best if you're using MySQL) is simply to delete all the matched records from the previous step and then insert any remaining records. The alternative is to use some advanced SQL to select the records that aren't already in the destination field.

If at all possible, avoid using a subquery such as:


INSERT INTO     table1 (field1,field2)
table1 (field1,field2)
SELECT          I.field1,I.field2
FROM            import_table I
WHERE			import_table.keyfield
NOT IN 		(SELECT keyfield FROM table1)

Although this is easy to follow, it's very inefficient and isn't suitable for large data sets. The most efficient solution is to use a join, for example:


INSERT INTO     table1 (field1,field2)
SELECT          I.field1,I.field2
FROM            import_table I
LEFT OUTER JOIN table1
ON              I.keyfield = table1.keyfield
WHERE           table1.keyfield IS NULL

The SELECT part of this statement returns all rows from the import table where there is no corresponding record in the destination table. This can be a little confusing, and if your RDMS supports VIEWS, it can sometimes be easier to do the outer join in the view and then query that.

In the following SQL, the keyfield from the destination table has been given an alias of "Existing_keyfield" to differentiate it from the keyfield in the import table. Existing_keyfield would then be NULL for new records.


INSERT INTO tablename (field1,...fieldn)
SELECT      field1,...fieldn
FROM        import_view
WHERE       import_view.existing_keyfield is NULL

This View can also be useful if you need to do any lookups while importing (i.e., to get ID numbers instead of name values from related tables).

If your RDMS doesn't support these techniques, you can simply delete all the matched records from the previous step before doing an insert.

Deleting Expired Records
The final stage of a complete import is the "delete." Of course as all good programmers know, it's best to never actually delete anything, instead you simply change its status. Using the reverse of the outer join used for the import, you can see which records in your system are no longer in the feed and update them accordingly. In SQL server, it's easy:


UPDATE           table1
/* your logic here to mark as expired or deleted */
SET              table1.status_id = -1 
FROM             import_table I
RIGHT OUTER JOIN table1
ON               I.keyfield = table1.keyfield
WHERE            I.keyfield IS NULL

If you don't like that syntax, you can again set up a view, or if you have small data sets you could use a subselect to get the keys of the records to be deleted. If you're using MySQL, you may find it easier to use ColdFusion to simulate a subselect by running one query to get the keys of the records to delete, and by using the ValueList() function to create an IN () clause.

Advanced SQL
No matter how complicated the import you need to do, there's usually a simple way to do it using an extra table of some description. One of the more complicated I've worked on involved users being imported into a database, and for every new user, a user record needed to be added and then another record entered into a table of subscriptions using the primary key generated by the insert. I did this using a SQL Server temporary table - populating it with the unique import key of any new records and then joining it to the users' table when it came to inserting the subscriptions. Another way would have been to create a UUID field in the import table and the users' table and join that.

Many developers have similar tales of long, slow, line-by-line procedures that they've eliminated using the simple concept of an extra table - either physical or temporary. In the second part of this series I'll look at another technique that also uses an extra table - reducing server load by storing search results in the database itself.

More Stories By Tom Peer

Tom Peer has been in electronic publishing of one sort or another for ten years, including a stint as manager of New Scientist Online (www.newscientist.com). He specializes in taking printed publications online and has recently completed the online edition of The World Handbook of Stock Exchanges (www.exchange-handbook.com).

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.