Welcome!

You will be redirected in 30 seconds or close now.

ColdFusion Authors: Yakov Fain, Jeremy Geelan, Maureen O'Gara, Nancy Y. Nee, Tad Anderson

Related Topics: ColdFusion

ColdFusion: Article

What Does an E-mail Address Actually Say?

What Does an E-mail Address Actually Say?

Here's the situation. You're relaxing, reading the latest issue of CFDJ, when your boss, significant other, or the voice in the back of your head asks you to write an e-mail retrieval program. This could be for an e-mail feedback system, an error alert manager, or even to handle the huge volume of e-mail that you receive from mailing lists. When this happens, you're going to smile and think that you can just whip out the <CFPOP> tag and use it to retrieve the e-mail you want. That's when the rude awakening hits you. The information you receive from this tag is not in the nice, clean format you're used to in Outlook or Eudora.

Now What?
What you're looking at is the header information of the e-mail message. When we look at an e-mail, we're used to seeing the poster's name, e-mail address, the message itself, and a few other pieces of information. The header of the e-mail contains all this and more. The problem is that it's all in a format that's more easily readable by the computer than by people. Additionally, the format of the information sometimes changes in ways we don't expect.

The obvious answer is to parse out the information. Find some way to define the structure of the information and write code to make it into something clean. This article is designed to do just that. We'll be examining one crucial piece of information stored in the header. This information happens to contain the e-mail address of the sender and sometimes his/her name. We want that information and we want it separated. To do this we'll use a technology called Regular Expressions (RegEx) that allows us to define a pattern and then look for that pattern inside a string. We pay particular attention to the patterns used in e-mail addresses and how to find them in this article.

Let's begin by using a small code sample from a spam catcher. This is an e-mail account that someone would use when posting to unknown locations. It's designed to catch spammers and limit the amount of spam you get to your "real" e-mail account. In order to use such a setup, it's important to look through the mail that it contains every now and again. The code fragment in Listing 1 allows us to read all of the mail messages in such an account and display them. For the purpose of this article, we're only going to get the headers of the messages and then show only the FROM addresses (see Figure 1).

As we can see from the results, the actual FROM address can be very different. Actually, there are only five basic formats for addresses in the mail header.

Name <[email protected]>
"Name" <[email protected]>
[email protected] (Name)
[email protected]
<[email protected]>

In the first three examples, a plain-text name is sent along with the e-mail address. In the last two, only an e-mail address is sent. This gives us a challenge: how to parse the full e-mail address to get the plain-text name and the actual e-mail address. The answer is actually easier than you'd think. The user-defined function (UDF) in Listing 2 will do this.

Rather Tight, Don't You Think?
Besides being tight, it's also totally incomprehensible to average programmers if they don't know Regular Ex-pressions, UDFs, CFSCRIPT, or the syntax that goes along with it. Let's go through it line by line so we can understand what's going on and why:
1   <CFSCRIPT>: In ColdFusion 5 a UDF has to be written inside a CFSCRIPT block. In CFMX the new CFFUNCTION tag can be used as well. As many people are still using CF5, let's use the "old" method of writing a UDF. CFSCRIPT has to have an opening and closing tag, with all of the actual code written between them. The closing CFSCRIPT tag ends on line 35.
2   Function call: A UDF is defined by using the key word function, followed by the name of the UDF and then open and closed parentheses. Inside the parentheses are the arguments we want to pass to the function. Following the standard rule of using descriptive names, we call the function "ParseEmail" and the argument that it will accept "email". Note that when using this UDF on a page, you can't define a variable called ParseEmail. That name is taken by the UDF.
3   After declaring a UDF, you have to place all of the code for it inside curly braces. This is simply the open brace. Note that I place the brace inline with the function call. This is so it'll be easier to see where something starts. All code inside the braces will be tabbed in by one. Proper tabbing helps greatly in reading and debugging code. The closing tab of this section is on line 34.
5-7   These lines set up variables that will exist only inside the UDF. We do this so we can work with data and not worry about overwriting outside variables. A UDF should be specific in what it accepts and uses. For lines 5-7 all we're doing is creating the variables with a NULL value (i.e., a blank string). Line 9 is the first line where we do some actual work, and this one has to be heavily explained.
9   We're using a standard CF function called REFind(). The RE stands for Regular Expression (RegEx) and the Find means we're trying to find something using the RegEx. A RegEx is a pattern that will be applied to a string or variable. If this pattern is found within the string or variable, then something will be returned to indicate success.

The REFind() function takes from two to four arguments. The first argument is the RegEx itself, the pattern we're looking for. The second is the string (or variable) that the RegEx pattern will be applied to. If only these two arguments are used, then the function will return a number that represents the starting location of the match. If there is no match, a zero will be returned. The third argument is the character location in the string where we want to start. This usually isn't needed unless the fourth argument is used. The fourth argument says that rather than return the numeric location of the start of the match, the function should return a structure that contains the position of the match and its length. It also returns the position and length of any subexpressions (we'll deal with them soon). This fourth argument is a Boolean value.

Note: In the code example we'll be using 1 to represent true. In examples in other places you may see "True" or "Yes". They all mean the same thing.

Looking at RegEx
Now let's examine the RegEx that we're looking for. If you haven't used RegEx before, it may look like your editor threw up on the screen, but it actually means something.

^"?([^"<]*)"? *<([^@][email protected][^>]+)>

I'm not going to try and teach everything about RegEx here. Instead I'm going to describe what each piece does and why. By reading this step-by-step examination, you should learn the basics.

^ Start at the beginning of the string. If the pattern doesn't start at the beginning, it fails. For example, if there is a space before the e-mail address, then it isn't valid.

"? There may or may not be a double quote as the first character in the string. The question mark (?) says that the previous character is optional.

( Start a subexpression, a grouping together of a part of the RegEx so that it's "seen" as a single unit. When the REFind() function has the fourth argument set to true, this subexpression is returned as a separate piece of data in the result structure.

[ Start looking for a set of characters. Rather than look for one character to match, this will look for one of many characters.

^ Note that this carat is inside a set declaration (the square brackets). When placed here, it means that we're negating the character set, that is, match any character other than whatever follows.

"< These are the characters we're looking for (or not looking for, as the case may be). We want every character that is not a double quote or an open bracket.

] Close the character set. The character set is now complete and will match any character that is not a double quote (") or an open angle bracket (<).

* This is a modifier much like the question mark (?) above. It will take the last character, set, or group and change how many times it can exist in the match. Specifically, it says that the character set that we've defined should exist zero or more times, that is, it will match any number of characters that aren't a double quote (") or an open angle bracket (<).

) This closes the subexpression group. Our complete subexpression says to match any character that isn't a double quote or an open bracket and to match zero or more of those characters. It will continue to match until we run into one of the two characters we don't want.

"? Again, we're looking for a double quote that may or may not exist.

* Note that before that asterisk is a space. This says that zero or more spaces can exist at this point.

< In three of the e-mail address formats, the actual address is inside brackets.

( Again, we're doing a subexpression group.

[^@]+ As with the set above, we're looking for any character that isn't an at sign (@). The plus means that one or more characters have to exist, that is, there must be a character before the @ sign. Note that we are not validating the e-mail address. This function was written to parse e-mail that has already been sent and the address should already be valid. A future article will go over all of the pieces needed to validate an e-mail address.

@ After finding the character(s) before the @, we need to specify the @ sign. All e-mail addresses have one.

[^>]+ Our last set is any character that isn't a closing bracket. One or more have to exist and this will be the domain portion of the address.

) When we're done, we close the subexpression. The entire subexpression says match one or more characters that are not an @ sign, followed by an @ sign followed by one or more characters that isn't a closing bracket.

> After the address there is a closing bracket.

This entire expression will match the following addresses

Name <[email protected]>
"Name" <[email protected]>
<[email protected]>

In the first case, there are no double quotes, so the '"?' takes care of that. There's a space before the bracket, so the '*' takes care of that. The second case uses the double quotes and the code takes that into account with the '"?' and the negative set containing a double quote '[^"<]'. The third case has no name at all, so the use of the ? (may or may not exist) and the * (may exist zero or more times) came in handy.

Two to Go
Three cases down, but what about the other two? Well, if one of the other address formats existed, the RegEx would fail against it, that is, there would be no match. So how do we detect this?

When a REFind() function is used and return subexpressions is set to true (the fourth argument), then the result of the function is a structure. The structure has two keys in it, Pos and Len. Pos is the start position of the match. Len is how long the match was. Each of the structures contains an array. Each array is equal in length to the other key in the structure, that is, if the Pos array has three items, then the Len array will have three as well. The first item is always the entire match. If there was no match then the Pos[1] will be 0 and the Len[1] will be 0. If there were a match, then their values would be the start of the match and its entire length. This is nice, but not what we want. We want to get the position and length of each of the subexpressions. In the example above, if the RegEx matched, then the [2] position of the arrays would contain the first subexpression results (the name) and the [3] position of the array would contain the second subexpression results (the address). Of course, the first sub-expression results may be blank, but we'll deal with that later.

Note: No matter what, a REFind() function set up like this will return a structure.

Now that we've defined the RegEx used in line 9, let's move forward.

Back to the Lines
11   We want to know if the subexpressions were returned properly or if we have to try something else. We do this by checking the value of the Pos[1]. If it's zero, then there was no match at all. If it's a positive number, then there was a match and the subexpressions have to exist.
14-15   Using the mid() function, we get the subexpression data and place it in the local variables. The mid() function takes three arguments: a string to work with, a position inside the string to start at, and a length of characters from that position to return. This function is perfect for getting subexpressions.

The first value we'll be getting from the RegEx will be the name portion. As we said earlier, line 7 sets the name variable as being local to the UDF so we don't have to worry about overwriting any other variable called name. The same can be said for the e-mail variable.

The first subexpression was written in such a way as to allow a blank record, that is, there's only an e-mail address but no name. Even if it's blank, there will be a record in the arrays to say that it was at least tried.
18   If the first RegEx ran didn't return any value, then the e-mail address has to be either the third or fourth type listed at the beginning of this article. Both start with the e-mail address without any brackets around it. One also has the poster's name in parentheses after the e-mail address. To parse this out, we use the same steps as the previous RegEx but with two small alterations.

  • We'll use the REFindNoCase() version of the tag.
  • We'll use a different RegEx pattern.

    20   REFindNoCase() Above we used the REFind() function when dealing with the RegEx pattern because we weren't looking to match any characters. When we have to match characters, we have a small issue to deal with: the case of the character. In RegEx the case is important and an uppercase version of a character is different from a lowercase version. To help deal with this, a second version of the function exists: REFindNoCase(). This works the same as the REFind() function, but allows matching of characters regardless of case. The NoCase version is slightly slower (.01 ms or so), as it has to look at both the lower- and uppercase versions of a letter, but we're not really concerned with the minuscule savings.

    ^([-a-z0-9_.][email protected][^[:space:]]+) *\(?([^)]+)?\)?

    ^ Once again we start this match at the beginning of the string [-a-z0-9_.]+. This set is loaded with a lot of interesting things. The first character is a dash (-). Normally in a set this is used to signify a range of characters. The character to the left of the dash is the start of the range and the character on the right is the end. To signify the dash as a character rather than a range indicator, we have to place it as the first character in the set. We next have a range of letters from a to z. Because this RegEx is inside the NoCase version of the function, it actually means the same as a-zA-Z, that is, all characters from lowercase a to lowercase z inclusive, as well as all characters from uppercase A to uppercase Z inclusive. A bit shorter to write than using each character individually.

    Note: There are special character sets that could be used in place of the a-z, 0-9, or both together. I've left them out and used the pattern as is to make it easier to learn and understand.

    Included in this set are the underbar (_) and the period (.). Usually in RegEx the period is a wild card character, that is, it matches any single character. This is true everywhere but within a set. Within a set most special characters such as the period aren't considered special and are instead seen as their literal character. This means that our set will match a character from a-z, A-Z, 0-9, the dash (-), the underbar (_), and the period (.). Because of the plus (+) after the set, one of these characters must exist and may exist as many times as possible. That is, until it comes to a character not in the set.

    @    One such character not in the set is the at sign (@). In any e-mail address pattern, one of these must exist. If it doesn't, then there's a big problem with the e-mail address and it may be forged or something else.

    [^[:space:]]+ Previously I mentioned special character sets. This is one such set, the space set. When used, it represents any character that can separate text but can't be seen - spaces, tabs, new lines, vertical tabs, form feeds, and carriage returns. Usually these special sets are shown as [[:space:]] to say that it's a set and it's special. We're doing something a little special here by saying that we want any character that is not one of those within the special set, so we add the carat (^) inside the first set of brackets. This does the same thing we've done before by negating the set. It just looks strange the way it's presented. Many programmers don't realize this can be done.

    ^([-a-z0-9_.][email protected][^[:space:]]+) This en-tire match is placed within parentheses so that it will be returned as a subexpression. This will be the e-mail portion of the string. It's possible that there's a name portion as well, so we add some additional code to test for that.

    *    There is a space before the asterisk, which means a space can come after our subexpression and more than one can exist. If there is no name portion in this match, then the asterisk allows us not to have a space. If there is one, there will be a space between the e-mail portion and the name. We could use a question mark (?) rather than an asterisk to say that a space may or may not be needed there, but there may be mail programs that use more than one space, so it never hurts to look to the future.

    Another Confusing Pattern
    We're about to start another one of those really confusing-looking RegEx patterns. Let's take this one really slow:

    \(?   The backslash (\) before the open parenthesis says that we're looking for a literal parenthesis. The parenthesis is normally the start of a subexpression, but because we're escaping it using the backslash, it's treated like the actual character open parenthesis. The question mark after it says that it's optional. This means that an open parenthesis may or may not exist at this point.

    ([^)]+)?   We're setting up another subexpression here. Because there's a question mark at the end of it, we're saying that the entire expression may or may not exist in our match. The expression itself is a simple set, where we're looking for one or more of any character that isn't a closed parenthesis.

    \)?   Finally, we're going to match with the closed parenthesis that may or may not exist.

    (?([^)]+)?\)? The entire section will match the end of our third e-mail example, a parenthesis that has some text within it followed by a parenthesis. Both parentheses are optional, as is the character string within them. Ugly to look at, but effective for what we want.
    21   Again we test the return from the RegEx function to see if it has a value within the arrays. If it fails at this point, the data passed to this UDF did not contain an e-mail address of one of the proper formats.
    23-26   At this point we have a successful match and will grab the subexpressions using the mid() function as described above. The use of mid() to get the e-mail is rather straightforward, but the use of it to get the name is a little trickier. In ColdFusion 5, because of the way we structured the last part of the RegEx pattern, the second subexpression (which is the third item in the array) is totally optional. In CFMX it will always exist even though it's set to optional with the question mark modifier (?). Therefore, we have to see how long one of the arrays is and what it contains before using mid() on that position.
    25   Here we're just testing if the array len() of the len portion of the array is 3 and if it has a value in it other than zero. If so, then the pattern matched a name as well as an e-mail address, and that name should be set to the name variable.
    30-31   It's possible at this point that the name variable is still blank even though a proper e-mail address was passed to this UDF. If so, we're going to set the name variable to a value of a single space. This will be important later when we want to display the data.
    33   At this point we have a name and an address variable. If no properly formatted e-mail address was passed to this UDF, then both variables are loaded with their default values, which are NULL (a blank string). Otherwise, one or both of the variables have actual text in them. The Return key word within a UDF will take whatever variable or value that's after it and pass it back to where the UDF was called. In this case we're going to cheat. A UDF should generally return only a single piece of data. Since the name and the e-mail address are so tightly interwoven, we're going to return both. To do so, we'll set them up as a list with a comma separating the two pieces of data. The first value will be the name; the second, the e-mail address.

    Let's say that this UDF was called to get the e-mail only. To do that, all you need to do is place it within a ListLast() function to get the last value in the list, which will always be the e-mail address. If you want the name, you can use the ListFirst() function. When there's no name portion in the e-mail, a space will be returned because in ColdFusion an empty list entry is ignored.

    Now let's run our original code but with this UDF added. To make the example cleaner, we're going to save the above UDF in a separate file and use a CFINCLUDE to add it to the code in Listing 3.

    The FROM column in Figure 2 contains the original e-mail addresses from the header as returned by the CFPOP tag. The Name column contains the parsed name, and the Address column contains the parsed e-mail address. Nice and clean. This information could be added to a database or used in many different ways.

    This UDF would be an important part of any ColdFusion-based e-mail application. It's used literally hundreds of times a day in the House of Fusion list archives (www.houseoffusion.com/cf_lists) and runs both fast and smoothly. You can add it to any application using <CFPOP> to allow you to format the results for cleaner display and storage. The only limit to it is that it's a UDF and can be used only with ColdFusion v5 or better. You could remove it from the UDF "wrapper" and add it into older versions of ColdFusion, but versions earlier than CF4 may have some differences in RegEx support.

    As you can see from this article, Regular Expressions are extremely powerful even if they look painful. In reality, there are only about two dozen actual commands and they can be picked up easily.

    Sources
    You can learn more about Regular Expressions at the following locations:

  • Regular Expressions in ColdFusion: www.houseoffusion.com/RegEx.ppt
  • Regular Expression Power Tips (from Cfun 02): www.cfconf.org/cfun-02/talks/regularexpressions.ppt and www.cfconf.org/cfun-02/talks/regularexpressionscode.zip
  • Macromedia documentation: http://livedocs.macromedia.com/cf50docs/ CFML_Reference/Functions194.jsp, http://livedocs.macromedia.com/cf50docs/ CFML_Reference/Functions195.jsp, and http://livedocs.macromedia.com/cf50docs/ Developing_ColdFusion_Applications/regexp.jsp

    In addition, a new version of the RegEx bible has been released:

  • Friedl, J.E.F., and Oram, A. (ed) (2002). Mastering Regular Expressions 2nd ed. O'Reilly.
  • Comments (2) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    Arden Weiss 10/28/02 07:32:00 PM EST

    Great paper -- Thanks Michael...

    ee 10/04/02 11:36:00 AM EDT
    IoT & Smart Cities Stories
    If a machine can invent, does this mean the end of the patent system as we know it? The patent system, both in the US and Europe, allows companies to protect their inventions and helps foster innovation. However, Artificial Intelligence (AI) could be set to disrupt the patent system as we know it. This talk will examine how AI may change the patent landscape in the years to come. Furthermore, ways in which companies can best protect their AI related inventions will be examined from both a US and...
    The challenges of aggregating data from consumer-oriented devices, such as wearable technologies and smart thermostats, are fairly well-understood. However, there are a new set of challenges for IoT devices that generate megabytes or gigabytes of data per second. Certainly, the infrastructure will have to change, as those volumes of data will likely overwhelm the available bandwidth for aggregating the data into a central repository. Ochandarena discusses a whole new way to think about your next...
    Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight...
    Bill Schmarzo, Tech Chair of "Big Data | Analytics" of upcoming CloudEXPO | DXWorldEXPO New York (November 12-13, 2018, New York City) today announced the outline and schedule of the track. "The track has been designed in experience/degree order," said Schmarzo. "So, that folks who attend the entire track can leave the conference with some of the skills necessary to get their work done when they get back to their offices. It actually ties back to some work that I'm doing at the University of ...
    DXWorldEXPO LLC, the producer of the world's most influential technology conferences and trade shows has announced the 22nd International CloudEXPO | DXWorldEXPO "Early Bird Registration" is now open. Register for Full Conference "Gold Pass" ▸ Here (Expo Hall ▸ Here)
    @DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time t...
    CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
    The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
    The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and sh...
    What are the new priorities for the connected business? First: businesses need to think differently about the types of connections they will need to make – these span well beyond the traditional app to app into more modern forms of integration including SaaS integrations, mobile integrations, APIs, device integration and Big Data integration. It’s important these are unified together vs. doing them all piecemeal. Second, these types of connections need to be simple to design, adapt and configure...