| By Chip Temm | Article Rating: |
|
| December 4, 2005 06:15 PM EST | Reads: |
18,470 |
It's a great problem to have: thousands of people like yourself providing daily updates from the leading edge of your craft. If you want to learn where things are going in the world of Web programming, you need to read blogs and lots of them.
Aggregators help make that easier, but only by reducing the amount of running around to find new sources. The quantity of content flowing through these channels is daunting and the quality highly variable. Somewhere in this big stack of articles, sometimes hundreds a day, are things you need to read to stay ahead of the curve.
How do you find them? Maybe you use Google's deskbar and load it up with a couple of aggregators or use Firefox's live bookmarks to do the same thing. Now you have to scan the headlines flowing in every couple of hours and see if something interesting has come up. Hopefully the author gave the item a good headline. The problem is pretty clear: we need something to help us filter the flow and highlight items we would be interested in.
Some folks are looking at collaborative filtering techniques. They want to get readers to rate blog items. This could be a good approach if you can get readers to click on the rating link; then you have cheating or malicious behavior to worry about. Messy. I'd like something more empirical. Maybe measuring cross-references to aid in weighting a la Google. Complicated.
What I want is a system that watches what I read and can predict from that data what I would be interested in reading in the future. Better yet, maybe I could teach the
system how to categorize items in addition to filtering out uninteresting stuff. Spam filters do this kind of thing and, with a bit of adaptation, we can bend the concepts behind spam filtering into something useful for blog filtering and build ourselves a flog - a filtered Web log.
Most spam filters use some kind of optimized statistical algorithm to learn what you think is spam. You train the filter by marking items as spam as they come into your inbox or moving items out of the quarantine area if the filter got it wrong. This data builds up over time and your filter becomes more and more accurate. I will discuss a very simple naïve Bayesian algorithm I found while doing research for this article ("Build Your Own Bayesian Spam Filter" by John Graham-Cumming, 2005), which we will wrap up into a CFC so that we can put it to all kinds of interesting uses. We won't do a lot of optimizations here for the sake of a simple code sample - you can check out http://anthrologik.net/flogger if you want to watch this concept evolve over time or better yet to contribute to it!
Our system will contain the following major concepts: the text we want to categorize (an article), the body of knowledge accumulated so far (the corpus), and categories that articles can belong in. In order to add to the body of knowledge, we train the system by telling it that a given article belongs in a given category. If we are confident that our training has taught the system enough, we can have it make recommendations to us by classifying articles that we pass to it. Eventually, we will have the system classify all incoming articles and start either simply accepting its classification recommendations or, if it's wrong, changing the categories for problematic articles. We then feed this as training data back into the system as training. (The source code for this article can be downloaded from http://cfdj.sys-con.com.)
We know from this brief analysis that we need at least two methods (train and classify) and one persistent object (corpus). Our corpus will consist of a struct that looks like this:
Where count is the number of times the word has appeared in all articles trained into the corpus to date (see Table 1).
When we call train(inputText, category), our new component will perform a word frequency count for our input text and add the data to the appropriate section of the corpus. In the example above, if we train a new article into the coldfusion category and the article contains the word "component" five times, our corpus would now show (see Table 2):
In Listing 1, we have an argument parsedTextStruct instead of just a string. This refers to a method we use to parse incoming text and create a struct containing unique words in the text as keys and their frequency of occurrence as values. We'll skip over that for the sake of brevity, but the method is included in the download for this article. Since it is off in its own method, we can experiment independently with different ways of parsing articles and see how this affects parsing speed and system accuracy including filtering out "stop words" like prepositions, conjunctions, and pronouns.
You can see that this is really very simple. If the word "coldfusion" occurs five times in an article and we have in our corpus a "CF" category containing an entry for "coldfusion" with a value of "200" (meaning that coldfusion has occurred 200 times in all of the text classified to date into the CF category), training our corpus by placing our new article into the "CF" category will bring the value of CF.coldfusion to 205. This means that for all of the articles categorized thus far into the CF category, the word "coldfusion" occurred 205 times. See? Training is easy. The magic is in the classification method.
When we call classify(inputText), we are asking the system to tell us how likely it is that our article falls into any of the currently defined categories, or to be more accurate the relative likelihood. This prediction uses a naïve Bayesian classifier that's based on the Bayes Theorem, which states that the probability of A happening given B happening is equal to the probability of B happening given A times the probability of A divided by the probability of B. This is written as P(A|B) = P(B|A)P(A) / P(B). For a more in-depth discussion of Bayes Theorem and naïve Bayesian classifiers, see Wikipedia (http://en.wikipedia.org/wiki/Bayes_rule ). If you are wondering, the "naïve" means that we will make the assumption that the words in our articles have no relationship to each other. While this is clearly a false assumption, it simplifies the math considerably and in practice produces results that are acceptable for our purposes; it is not uncommon to see a moderately trained classifier reach percentages of accuracy in the high nineties.
In our case, the system is looking for the probability of an article (A) being placed into one of our categories (Ci): P(Ci|A). To get that we need to calculate the probability that for category Ci the set of words in the article appears in the category: P(A|Ci). We then multiply that by P(Ci), the probability of a word appearing in Ci or the word count of Ci divided by the total word count for all categories. Now we have:
P(Ci|A) = P(A|Ci) * P(Ci)
We will drop the probability of the article occurring, P(A), from our calculation because it is unknown and we are, in any case, only interested in the relative values of P(C|A) for our categories and P(A) will be constant across all of our calculations P(C1|A) ... P(Cn|A).
P(A|Ci) can be expanded as being the product of the probabilities for each word (W) in the article appearing in the category:
P(A|Ci) = P(W0|Ci) * P(W1|Ci) * ... * P(Wn-1|Ci)
Each of these is just the number of occurrences of the word in the category divided by the total number of words in the category. That's not too hard.
Published December 4, 2005 Reads 18,470
Copyright © 2005 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Chip Temm
Over the past decade, Chip Temm moved from North America to Europe and on to Africa where his company anthroLogik solutions provided analysis and development services to non-governmental organizations across seven timezones. He is currently back in Washington, DC where "remote development" means working from home and "wildlife" means raccoon.
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Adobe Reader Sued
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Adobe Flex Developer Earns $100K in New York City
- Adobe LiveCycle Enterprise Suite 2 for Cloud Computing
- Adobe Cans Another 9% of its Workforce
- Adobe Betas Target RIAs and Cloud Computing
- Adobe MAX 2009 Online
- Thinking of Flex in London
- Moyea DVD4Web Converter V2.0 Converts DVD to FLV Fast and Synchronously with Watermarks
- Adobe & Salesforce Cut Cloud Deal
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Eval JavaScript in a Global Context
- Fig Leaf Software to Exhibit at Government IT Conference & Expo
- Is Microsoft as Free as Open Source?
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Adobe Reader Sued
- The Planet Named “Bronze Sponsor” of Cloud Computing Expo
- Microsoft Expression Web Has Got Game
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Bruce Chizen Joins Voyager Capital as Venture Partner
- My Top Seven Wishes From Adobe MAX 2009
- Adobe Flex Developer Earns $100K in New York City
- The Next Programming Models, RIAs and Composite Applications
- Where Are RIA Technologies Headed in 2008?
- Constructing an Application with Flash Forms from the Ground Up
- AJAX World RIA Conference & Expo Kicks Off in New York City
- CFEclipse: The Developer's IDE, Eclipse For ColdFusion
- Personal Branding Checklist
- Adobe Flex 2: Advanced DataGrid
- Has the Technology Bounceback Begun?
- Building a Zip Code Proximity Search with ColdFusion
- i-Technology Viewpoint: We Need Not More Frameworks, But Better Programmers
- The Asynchronous CFML Gateway
- Web Services Using ColdFusion and Apache CXF





































