| By Chip Temm | Article Rating: |
|
| December 4, 2005 06:15 PM EST | Reads: |
18,488 |
Now multiply that by P(Ci) or the number of words in our category Ci divided by the total number of words in the whole corpus.
P(Ci|A) = P(W0|Ci) * P(W1|Ci) * ... * P(Wn-1|Ci) * P(Ci)
What if the word we are looking at doesn't appear in the category we are assessing against? Let's use a tiny number (0.1) to represent that we don't really know the probability but there is a chance that varies with the size of the word count of the category.
Now we have the core of our classify method, an equation that will produce relative scores for the relevance of an article to a category. Remember that our dropping the P(A) denominator has turned this into a ranking exercise, not a measure of probability. That also frees us up to use the log() function to reduce the size of the numbers we are operating on and increase performance by turning the multiplication of our terms into addition remembering that log(X * Y) = log(X) + log(Y).
P(Ci|A) = log(P(W0|Ci)) + log(P(W1|Ci)) + ... + log(P(Wn-1|Ci)) + log(P(Ci))
The implementation in Listing 2 is translated from the Perl included in John Graham-Cummings' excellent article "Build Your Own Bayesian Spam Filter."
Now we can do a small example (see Listing 3), which dumps out to:
coldfusion -13.0321979407
flash -8.93316952915
By comparing the numerical values, we can see that our article is more likely to be about Flash than ColdFusion (flash > coldfusion). What if we remove the word "flash" from our testdata article? Now we get:
coldfusion -5.37759054744
flash -6.070737728
ColdFusion is suddenly more relevant. If we want a best-guess recommended categorization, we can run our recommendations return through the structSort function:
sortedRecommendations = structSort(recommendations,'numeric','desc');
recommended = sortedRecommendations [1]; //currently equal to "coldfusion"
From here it is fairly short work to envision a system that reads feeds or an aggregator and presents something like Figure 1, allowing users to move items from the title list to the appropriate category to train the system.
After several weeks of training the system with several hundred articles, you could trust it to classify incoming articles and re-present them as filtered feeds. By the time this article reaches you, I should have such a system operational. Stay tuned at http://anthrologik.net/flogger.
This system has two things going for it: the concepts are not new or revolutionary (the math is about 250 years old and the concepts driving information-retrieval technology are of similar vintage) or complicated (the classify method is about 40 lines of code). These two properties usually make for a nice, stable library. Hopefully this library can be leveraged to make the blogosphere a more manageable source of information.
Resources
- Graham-Cumming, J. "Build your own Bayesian spam filter": www.jgc.org/antispam/05152005-92b9aec9357df88a2ce200056b18ee74.pdf
- Lewis, D. D. "Naïve (Bayes) at Forty: The Independence Assumption in Information Retrieval": http://citeseer.ist.psu.edu/cache/papers/cs/26985/ http:zSzzSzwww.ai.mit.eduzSzpeoplezSzjimmylinzSzpaperszSzLewis98.pdf/ lewis98naive.pdf
- Wikipedia.org, "Bayes Rule": http://en.wikipedia.com/wiki/bayes_rul
Published December 4, 2005 Reads 18,488
Copyright © 2005 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Chip Temm
Over the past decade, Chip Temm moved from North America to Europe and on to Africa where his company anthroLogik solutions provided analysis and development services to non-governmental organizations across seven timezones. He is currently back in Washington, DC where "remote development" means working from home and "wildlife" means raccoon.
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Adobe Flex Developer Earns $100K in New York City
- Adobe LiveCycle Enterprise Suite 2 for Cloud Computing
- Adobe Betas Target RIAs and Cloud Computing
- Adobe Cans Another 9% of its Workforce
- Moyea DVD4Web Converter V2.0 Converts DVD to FLV Fast and Synchronously with Watermarks
- Adobe Fiddles with its Web Apps
- Adobe & Salesforce Cut Cloud Deal
- Hosting.com Launches ColdFusion 9 in the Cloud
- The Real Time Infrastructure Ultimatum
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Eval JavaScript in a Global Context
- Fig Leaf Software to Exhibit at Government IT Conference & Expo
- Cloud Computing Journal: Adobe to Deliver ColdFusion in the Cloud
- Is Microsoft as Free as Open Source?
- Adobe Reader Sued
- The Planet Named “Bronze Sponsor” of Cloud Computing Expo
- Microsoft Expression Web Has Got Game
- Adobe May Cooperate with Apple to Transplant Flash Player to iPhone
- Adobe Flex Developer Earns $100K in New York City
- Bruce Chizen Joins Voyager Capital as Venture Partner
- My Top Seven Wishes From Adobe MAX 2009
- The Next Programming Models, RIAs and Composite Applications
- Where Are RIA Technologies Headed in 2008?
- Constructing an Application with Flash Forms from the Ground Up
- AJAX World RIA Conference & Expo Kicks Off in New York City
- CFEclipse: The Developer's IDE, Eclipse For ColdFusion
- Personal Branding Checklist
- Adobe Flex 2: Advanced DataGrid
- Has the Technology Bounceback Begun?
- Building a Zip Code Proximity Search with ColdFusion
- i-Technology Viewpoint: We Need Not More Frameworks, But Better Programmers
- The Asynchronous CFML Gateway
- Web Services Using ColdFusion and Apache CXF






























