Книга: Practical Common Lisp

Training the Filter

Training the Filter

Since you wrote classify and train to take a string argument, you can test them easily at the REPL. If you haven't yet, you should switch to the package in which you've been writing this code by evaluating an IN-PACKAGE form at the REPL or using the SLIME shortcut change-package. To use the SLIME shortcut, type a comma at the REPL and then type the name at the prompt. Pressing Tab while typing the package name will autocomplete based on the packages your Lisp knows about. Now you can invoke any of the functions that are part of the spam application. You should first make sure the database is empty.

SPAM> (clear-database)

Now you can train the filter with some text.

SPAM> (train "Make money fast" 'spam)

And then see what the classifier thinks.

SPAM> (classify "Make money fast")
SPAM
SPAM> (classify "Want to go to the movies?")
UNSURE

While ultimately all you care about is the classification, it'd be nice to be able to see the raw score too. The easiest way to get both values without disturbing any other code is to change classification to return multiple values.

(defun classification (score)
(values
(cond
((<= score *max-ham-score*) 'ham)
((>= score *min-spam-score*) 'spam)
(t 'unsure))
score))

You can make this change and then recompile just this one function. Because classify returns whatever classification returns, it'll also now return two values. But since the primary return value is the same, callers of either function who expect only one value won't be affected. Now when you test classify, you can see exactly what score went into the classification.

SPAM> (classify "Make money fast")
SPAM
0.863677101854273D0
SPAM> (classify "Want to go to the movies?")
UNSURE
0.5D0

And now you can see what happens if you train the filter with some more ham text.

SPAM> (train "Do you have any money for the movies?" 'ham)
1
SPAM> (classify "Make money fast")
SPAM
0.7685351219857626D0

It's still spam but a bit less certain since money was seen in ham text.

SPAM> (classify "Want to go to the movies?")
HAM
0.17482223132078922D0

And now this is clearly recognizable ham thanks to the presence of the word movies, now a hammy feature.

However, you don't really want to train the filter by hand. What you'd really like is an easy way to point it at a bunch of files and train it on them. And if you want to test how well the filter actually works, you'd like to then use it to classify another set of files of known types and see how it does. So the last bit of code you'll write in this chapter will be a test harness that tests the filter on a corpus of messages of known types, using a certain fraction for training and then measuring how accurate the filter is when classifying the remainder.

Оглавление книги


Генерация: 1.690. Запросов К БД/Cache: 3 / 0
поделиться
Вверх Вниз