Книга: Practical Common Lisp

Training the Filter

Training the Filter

Now that you have a way to keep track of individual features, you're almost ready to implement score. But first you need to write the code you'll use to train the spam filter so score will have some data to use. You'll define a function, train, that takes some text and a symbol indicating what kind of message it is—ham or spam—and that increments either the ham count or the spam count of all the features present in the text as well as a global count of hams or spams processed. Again, you can take a top-down approach and implement it in terms of other functions that don't yet exist.

(defun train (text type)
(dolist (feature (extract-features text))
(increment-count feature type))
(increment-total-count type))

You've already written extract-features, so next up is increment-count, which takes a word-feature and a message type and increments the appropriate slot of the feature. Since there's no reason to think that the logic of incrementing these counts is going to change for different kinds of objects, you can write this as a regular function.[252] Because you defined both ham-count and spam-count with an :accessor option, you can use INCF and the accessor functions created by DEFCLASS to increment the appropriate slot.

(defun increment-count (feature type)
(ecase type
(ham (incf (ham-count feature)))
(spam (incf (spam-count feature)))))

The ECASE construct is a variant of CASE, both of which are similar to case statements in Algol-derived languages (renamed switch in C and its progeny). They both evaluate their first argument—the key form—and then find the clause whose first element—the key—is the same value according to EQL. In this case, that means the variable type is evaluated, yielding whatever value was passed as the second argument to increment-count.

The keys aren't evaluated. In other words, the value of type will be compared to the literal objects read by the Lisp reader as part of the ECASE form. In this function, that means the keys are the symbols ham and spam, not the values of any variables named ham and spam. So, if increment-count is called like this:

(increment-count some-feature 'ham)

the value of type will be the symbol ham, and the first branch of the ECASE will be evaluated and the feature's ham count incremented. On the other hand, if it's called like this:

(increment-count some-feature 'spam)

then the second branch will run, incrementing the spam count. Note that the symbols ham and spam are quoted when calling increment-count since otherwise they'd be evaluated as the names of variables. But they're not quoted when they appear in ECASE since ECASE doesn't evaluate the keys.[253]

The E in ECASE stands for "exhaustive" or "error," meaning ECASE should signal an error if the key value is anything other than one of the keys listed. The regular CASE is looser, returning NIL if no matching clause is found.

To implement increment-total-count, you need to decide where to store the counts; for the moment, two more special variables, *total-spams* and *total-hams*, will do fine.

(defvar *total-spams* 0)
(defvar *total-hams* 0)
(defun increment-total-count (type)
(ecase type
(ham (incf *total-hams*))
(spam (incf *total-spams*))))

You should use DEFVAR to define these two variables for the same reason you used it with *feature-database*—they'll hold data built up while you run the program that you don't necessarily want to throw away just because you happen to reload your code during development. But you'll want to reset those variables if you ever reset *feature-database*, so you should add a few lines to clear-database as shown here:

(defun clear-database ()
(setf
*feature-database* (make-hash-table :test #'equal)
*total-spams* 0
*total-hams* 0))

Оглавление книги


Генерация: 1.054. Запросов К БД/Cache: 3 / 0
поделиться
Вверх Вниз