Книга: Practical Common Lisp
Training the Filter
Training the Filter
Since you wrote classify
and train
to take a string argument, you can test them easily at the REPL. If you haven't yet, you should switch to the package in which you've been writing this code by evaluating an IN-PACKAGE
form at the REPL or using the SLIME shortcut change-package
. To use the SLIME shortcut, type a comma at the REPL and then type the name at the prompt. Pressing Tab while typing the package name will autocomplete based on the packages your Lisp knows about. Now you can invoke any of the functions that are part of the spam application. You should first make sure the database is empty.
SPAM> (clear-database)
Now you can train the filter with some text.
SPAM> (train "Make money fast" 'spam)
And then see what the classifier thinks.
SPAM> (classify "Make money fast")
SPAM
SPAM> (classify "Want to go to the movies?")
UNSURE
While ultimately all you care about is the classification, it'd be nice to be able to see the raw score too. The easiest way to get both values without disturbing any other code is to change classification
to return multiple values.
(defun classification (score)
(values
(cond
((<= score *max-ham-score*) 'ham)
((>= score *min-spam-score*) 'spam)
(t 'unsure))
score))
You can make this change and then recompile just this one function. Because classify
returns whatever classification
returns, it'll also now return two values. But since the primary return value is the same, callers of either function who expect only one value won't be affected. Now when you test classify
, you can see exactly what score went into the classification.
SPAM> (classify "Make money fast")
SPAM
0.863677101854273D0
SPAM> (classify "Want to go to the movies?")
UNSURE
0.5D0
And now you can see what happens if you train the filter with some more ham text.
SPAM> (train "Do you have any money for the movies?" 'ham)
1
SPAM> (classify "Make money fast")
SPAM
0.7685351219857626D0
It's still spam but a bit less certain since money was seen in ham text.
SPAM> (classify "Want to go to the movies?")
HAM
0.17482223132078922D0
And now this is clearly recognizable ham thanks to the presence of the word movies, now a hammy feature.
However, you don't really want to train the filter by hand. What you'd really like is an easy way to point it at a bunch of files and train it on them. And if you want to test how well the filter actually works, you'd like to then use it to classify another set of files of known types and see how it does. So the last bit of code you'll write in this chapter will be a test harness that tests the filter on a corpus of messages of known types, using a certain fraction for training and then measuring how accurate the filter is when classifying the remainder.
- 23. Practical: A Spam Filter
- 4.4.4 The Dispatcher
- About the author
- Chapter 7. The state machine
- Chapter 16. Commercial products based on Linux, iptables and netfilter
- Appendix E. Other resources and links
- What is an IP filter
- IP filtering terms and expressions
- How to plan an IP filter
- Example NAT machine in theory
- The final stage of our NAT machine
- Compiling the user-land applications