Playing with Mail and Leopard’s Latent Semantic Mapping
While clearing Mail.app’s junk mail folder, I might have accidentally deleted a non-spam message. I’ll never know for sure, but as a result I learned a bit more about Leopard’s cool new Latent Semantic Analysis framework that I’d been wondering about since the mailing list leaked back in November.
Mail stores its spam information in ~/Library/Mail/LSMMap2. In addition to the Latent Semantic Mapping framework, Apple also provides lsm, a command line utility that provides the same functionality (with a little better documentation, I might add). As described in the man page, you can use lsm dump ~/Library/Mail/LSMMap2
to get a list of all the words that Mail’s spam filter knows about. (Some words probably NSFW, of course!) The first column is how many times the word has appeared in a “Not Junk” message, and the second is the count in spam messages. The last line gives a few overall statistics: how many zeroe values there were versus total values, and a “Max Run” value I don’t understand.
Between this and CFStringTokenizer and its language-guessing coolness, Leopard provides some fun tools for playing around with text analysis. Hopefully someday I’ll have a bit more time to dig into it.
Until then (or rather, for my future reference), here’s a bit more information on Latent Semantic Analysis: how it works and how it differs from Bayesian classification.
I’ve also uploaded a really quick and dirty “playground” for testing out the hypotheses the documentation left me with: lsmtest.m
Update: Came across a more “explainy” article about both Mail and Latent Semantic Analysis over on macdevcenter.com.
[...] a glob of nerdishness » Playing with Mail and Leopard’s Latent Semantic Mapping In addition to the Latent Semantic Mapping framework, Apple also provides lsm, a command line utility that provides the same functionality (with a little better documentation, I might add). As described in the man page, you can use lsm dump ~/Library/Mail/LSMMap2 to get a list of all the words that Mail’s spam filter knows about. (Some words probably NSFW, of course!) (tags: osx cocoa lsm bayesian filtering spam mail.app) [...]
Pingback by links for 2009-11-12 | manicwave.com — November 12, 2009 @ 4:03 am