| By Tom Peer | Article Rating: |
|
| November 24, 2004 12:00 AM EST | Reads: |
11,045 |
A two-part series looks at techniques for shifting workload away from the application server onto the database by using "extra" database tables.
Most ColdFusion programmers understand that when it comes to bulk inserting into a database, it isn't good practice to loop over text files one line at a time with a <cfloop>. However, when faced with the realities of a data feed that needs preprocessing, comparing to existing data, and then postprocessing, good intentions sometimes fall by the wayside.
The theory is simple - if you have a large amount of data (typically tabbed or comma separated text) to import into your database, the best way to insert it is in a single import - either directly from SQL (with a bulk insert command for SQL Server), or else via an import utility (such as DTS for SQL Server).
In practice, it's a little more difficult than that, and far too often import routines are constructed in ColdFusion that read the complete text file, loop over it line by line, validate the data and format, and then either update an existing record or insert it into the database.
This technique is inefficient, slow, and prone to "falling over," but it's a technique that is used time and again, usually because of one of three types of problems:
1. The imported data is incorrectly formatted and needs preprocessing: For instance, you may be importing a feed where the country names in addresses are free text and not ISO codes, or values such as "England" or "Hawaii." Often these errors are consistent and can be fixed with ColdFusion but can't be imported directly into a table with a foreign key constraint requiring strict two-character country codes.
2. Existing records need updating and new records need adding (the data needs to be "matched and added"): Matching and adding often take place where data is supplied from an external system (e.g., user data). Existing users may have passwords or other data unique to the Web database that you don't want to overwrite. You could fudge this with one-to-one joins to separate tables (assuming you have a unique key to do it), but the neater way is to update existing data and add new data. A third stage in the process is often a "delete" - the removal of data no longer present in the feed. Ideally this stage should really be a "change status to expired," but "match, add, and delete" is easier on the tongue.
With a line-by-line import, the usual way of determining whether a record is already in the system is to query the database - meaning that for every line in the import you end up with two queries. It can be the most inefficient part of the whole process.
3. Business rules need to be applied to the data (postprocessing): The final issue is the need to apply business rules after an import. Typically requiring large doses of flexible ColdFusion code, business rules are difficult to apply to data imported in a batch.
Examples of business rules that I've come across that were cited as preventing batch import include allocating free magazine subscriptions to new subscribers of a different magazine, and e-mailing all subscribers to a magazine telling them they had free access to a Web site. Both of these were far easier to achieve in ColdFusion than pure SQL, and still are, even though the import process itself now runs as a batch.
The Extra Table
All these issues can be dealt with without resorting to line-by-line imports by using a very simple technique - the addition of a separate import table to hold the data while it's processed.
Instead of importing data directly into its intended destination table, it's imported into a dedicated import table that's a copy of the destination table without any constraints, keys, triggers, or anything else that might prevent a bulk import.
Once the data has been imported, it's then processed to ensure it conforms to all the constraints on the destination table. The data can either be fixed if there are consistent structural problems with the feed, or else removed from the import process (depending on the situation you might simply delete records or, at the other end of the scale you might have a third table to store incorrect records for offline fixing).
After the preprocessing, the process of updating the live data can begin - the process of "match, add, and delete". The crucial difference between doing this with an import table and with line-by-line is that with an import table you can use a single SQL command to match all the existing records, a single command to add all the new records, and a third single command to delete (or change the status of) any records no longer in the feed.
Importing the Data
There are various techniques you can use for importing the data, and this article doesn't cover them. BULK INSERT in SQL Server works well with tabbed text. Running DTS via SQL commands was covered in CFDJ in September and October 2003, and is needed if you have field qualifiers in your feed (e.g., ""s around the fields). Whatever technique you choose, ensure that it imports the complete data set into an import table for preprocessing.
Checking for Errors
The next stage in the process is to check for data integrity. This might mean applying rules such as requiring e-mail addresses to be in the format %@%.%, or it might consist of "normalizing" values such as country names or categories. For instance, data may be imported with a category of "News," which needs to be translated to a numeric key value from a related table.
All this can be done with either SQL or ColdFusion. Either way, the key is to write SQL that acts on sets of data rather than one line at a time. What you do with records that contain errors depends on the needs of your application. You can delete the records or copy them to another text file. Whether you're attempting the process in SQL server or ColdFusion, the ideal method is to have some way of marking them in the database.
Either use a status field in the import table or else (as I prefer to do) have a separate error table: a copy of the import table with extra fields to indicate the type of error and the date it occurred.
The exact details of this stage will depend on your requirements, but the principle is always the same - after this process, all data in the import table is ready for inserting into or updating the main table.
Matching
The match part of the process is the easiest and roughly the same for all RDMSs. Simply update all necessary fields where the existing data matches the import data:
UPDATE table1
SET field1 = import_table.field1,
... fieldn = import_table.fieldn
FROM table1, import_table
WHERE table1.keyfield = import_table.keyfield
The key field doesn't necessarily have to be the primary key of the main table - it just has to be a unique set of data that can be used to identify the existing data. If you're importing user data, this is often the e-mail address. There should be a unique constraint on this field in the main table if it isn't the primary key (which is unlikely).
Adding New Records
Adding new data is not so simple. By far the easiest method (and best if you're using MySQL) is simply to delete all the matched records from the previous step and then insert any remaining records. The alternative is to use some advanced SQL to select the records that aren't already in the destination field.
If at all possible, avoid using a subquery such as:
INSERT INTO table1 (field1,field2) table1 (field1,field2) SELECT I.field1,I.field2 FROM import_table I WHERE import_table.keyfield NOT IN (SELECT keyfield FROM table1)
Although this is easy to follow, it's very inefficient and isn't suitable for large data sets. The most efficient solution is to use a join, for example:
INSERT INTO table1 (field1,field2) SELECT I.field1,I.field2 FROM import_table I LEFT OUTER JOIN table1 ON I.keyfield = table1.keyfield WHERE table1.keyfield IS NULL
The SELECT part of this statement returns all rows from the import table where there is no corresponding record in the destination table. This can be a little confusing, and if your RDMS supports VIEWS, it can sometimes be easier to do the outer join in the view and then query that.
In the following SQL, the keyfield from the destination table has been given an alias of "Existing_keyfield" to differentiate it from the keyfield in the import table. Existing_keyfield would then be NULL for new records.
INSERT INTO tablename (field1,...fieldn) SELECT field1,...fieldn FROM import_view WHERE import_view.existing_keyfield is NULL
This View can also be useful if you need to do any lookups while importing (i.e., to get ID numbers instead of name values from related tables).
If your RDMS doesn't support these techniques, you can simply delete all the matched records from the previous step before doing an insert.
Deleting Expired Records
The final stage of a complete import is the "delete." Of course as all good programmers know, it's best to never actually delete anything, instead you simply change its status. Using the reverse of the outer join used for the import, you can see which records in your system are no longer in the feed and update them accordingly. In SQL server, it's easy:
UPDATE table1 /* your logic here to mark as expired or deleted */ SET table1.status_id = -1 FROM import_table I RIGHT OUTER JOIN table1 ON I.keyfield = table1.keyfield WHERE I.keyfield IS NULL
If you don't like that syntax, you can again set up a view, or if you have small data sets you could use a subselect to get the keys of the records to be deleted. If you're using MySQL, you may find it easier to use ColdFusion to simulate a subselect by running one query to get the keys of the records to delete, and by using the ValueList() function to create an IN () clause.
Advanced SQL
No matter how complicated the import you need to do, there's usually a simple way to do it using an extra table of some description. One of the more complicated I've worked on involved users being imported into a database, and for every new user, a user record needed to be added and then another record entered into a table of subscriptions using the primary key generated by the insert. I did this using a SQL Server temporary table - populating it with the unique import key of any new records and then joining it to the users' table when it came to inserting the subscriptions. Another way would have been to create a UUID field in the import table and the users' table and join that.
Many developers have similar tales of long, slow, line-by-line procedures that they've eliminated using the simple concept of an extra table - either physical or temporary. In the second part of this series I'll look at another technique that also uses an extra table - reducing server load by storing search results in the database itself.
Published November 24, 2004 Reads 11,045
Copyright © 2004 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Tom Peer
Tom Peer has been in electronic publishing of one sort or another for ten years, including a stint as manager of New Scientist Online (www.newscientist.com). He specializes in taking printed publications online and has recently completed the online edition of The World Handbook of Stock Exchanges (www.exchange-handbook.com).
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Adobe Reader Sued
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Adobe Flex Developer Earns $100K in New York City
- Adobe LiveCycle Enterprise Suite 2 for Cloud Computing
- Adobe Cans Another 9% of its Workforce
- Adobe Betas Target RIAs and Cloud Computing
- Adobe MAX 2009 Online
- Thinking of Flex in London
- Moyea DVD4Web Converter V2.0 Converts DVD to FLV Fast and Synchronously with Watermarks
- Adobe & Salesforce Cut Cloud Deal
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Eval JavaScript in a Global Context
- Fig Leaf Software to Exhibit at Government IT Conference & Expo
- Is Microsoft as Free as Open Source?
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Adobe Reader Sued
- The Planet Named “Bronze Sponsor” of Cloud Computing Expo
- Microsoft Expression Web Has Got Game
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Bruce Chizen Joins Voyager Capital as Venture Partner
- My Top Seven Wishes From Adobe MAX 2009
- Adobe Flex Developer Earns $100K in New York City
- The Next Programming Models, RIAs and Composite Applications
- Where Are RIA Technologies Headed in 2008?
- Constructing an Application with Flash Forms from the Ground Up
- AJAX World RIA Conference & Expo Kicks Off in New York City
- CFEclipse: The Developer's IDE, Eclipse For ColdFusion
- Personal Branding Checklist
- Adobe Flex 2: Advanced DataGrid
- Has the Technology Bounceback Begun?
- Building a Zip Code Proximity Search with ColdFusion
- i-Technology Viewpoint: We Need Not More Frameworks, But Better Programmers
- The Asynchronous CFML Gateway
- Web Services Using ColdFusion and Apache CXF



































