Welcome!

ColdFusion Authors: Maureen O'Gara, Hovhannes Avoyan, Yakov Fain, Pat Romanski, Liz McMillan

Related Topics: ColdFusion

ColdFusion: Article

Forget Blogs, Make Way For "Flogs" – Filtered Weblogs

How To Create an Adaptive Blog Filter

Now multiply that by P(Ci) or the number of words in our category Ci divided by the total number of words in the whole corpus.

P(Ci|A) = P(W0|Ci) * P(W1|Ci) * ... * P(Wn-1|Ci) * P(Ci)

What if the word we are looking at doesn't appear in the category we are assessing against? Let's use a tiny number (0.1) to represent that we don't really know the probability but there is a chance that varies with the size of the word count of the category.

Now we have the core of our classify method, an equation that will produce relative scores for the relevance of an article to a category. Remember that our dropping the P(A) denominator has turned this into a ranking exercise, not a measure of probability. That also frees us up to use the log() function to reduce the size of the numbers we are operating on and increase performance by turning the multiplication of our terms into addition remembering that log(X * Y) = log(X) + log(Y).

P(Ci|A) = log(P(W0|Ci)) + log(P(W1|Ci)) + ... + log(P(Wn-1|Ci)) + log(P(Ci))

The implementation in Listing 2 is translated from the Perl included in John Graham-Cummings' excellent article "Build Your Own Bayesian Spam Filter."

Now we can do a small example (see Listing 3), which dumps out to:

coldfusion -13.0321979407
flash -8.93316952915

By comparing the numerical values, we can see that our article is more likely to be about Flash than ColdFusion (flash > coldfusion). What if we remove the word "flash" from our testdata article? Now we get:

coldfusion -5.37759054744
flash -6.070737728

ColdFusion is suddenly more relevant. If we want a best-guess recommended categorization, we can run our recommendations return through the structSort function:

sortedRecommendations = structSort(recommendations,'numeric','desc');
recommended = sortedRecommendations [1]; //currently equal to "coldfusion"

From here it is fairly short work to envision a system that reads feeds or an aggregator and presents something like Figure 1, allowing users to move items from the title list to the appropriate category to train the system.

After several weeks of training the system with several hundred articles, you could trust it to classify incoming articles and re-present them as filtered feeds. By the time this article reaches you, I should have such a system operational. Stay tuned at http://anthrologik.net/flogger.

This system has two things going for it: the concepts are not new or revolutionary (the math is about 250 years old and the concepts driving information-retrieval technology are of similar vintage) or complicated (the classify method is about 40 lines of code). These two properties usually make for a nice, stable library. Hopefully this library can be leveraged to make the blogosphere a more manageable source of information.

Resources

More Stories By Chip Temm

Over the past decade, Chip Temm moved from North America to Europe and on to Africa where his company anthroLogik solutions provided analysis and development services to non-governmental organizations across seven timezones. He is currently back in Washington, DC where "remote development" means working from home and "wildlife" means raccoon.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.