Книга: Practical Common Lisp

The Heart of a Spam Filter

The Heart of a Spam Filter

In this chapter, you'll implement the core of a spam-filtering engine. You won't write a soup-to-nuts spam-filtering application; rather, you'll focus on the functions for classifying new messages and training the filter.

This application is going to be large enough that it's worth defining a new package to avoid name conflicts. For instance, in the source code you can download from this book's Web site, I use the package name COM.GIGAMONKEYS.SPAM, defining a package that uses both the standard COMMON-LISP package and the COM.GIGAMONKEYS.PATHNAMES package from Chapter 15, like this:

(defpackage :com.gigamonkeys.spam
(:use :common-lisp :com.gigamonkeys.pathnames))

Any file containing code for this application should start with this line:

(in-package :com.gigamonkeys.spam)

You can use the same package name or replace com.gigamonkeys with some domain you control.[248]

You can also type this same form at the REPL to switch to this package to test the functions you write. In SLIME this will change the prompt from CL-USER> to SPAM> like this:

CL-USER> (in-package :com.gigamonkeys.spam)
#<The COM.GIGAMONKEYS.SPAM package>
SPAM>

Once you have a package defined, you can start on the actual code. The main function you'll need to implement has a simple job—take the text of a message as an argument and classify the message as spam, ham, or unsure. You can easily implement this basic function by defining it in terms of other functions that you'll write in a moment.

(defun classify (text)
(classification (score (extract-features text))))

Reading from the inside out, the first step in classifying a message is to extract features to pass to the score function. In score you'll compute a value that can then be translated into one of three classifications—spam, ham, or unsure—by the function classification. Of the three functions, classification is the simplest. You can assume score will return a value near 1 if the message is a spam, near 0 if it's a ham, and near .5 if it's unclear.

Thus, you can implement classification like this:

(defparameter *max-ham-score* .4)
(defparameter *min-spam-score* .6)
(defun classification (score)
(cond
((<= score *max-ham-score*) 'ham)
((>= score *min-spam-score*) 'spam)
(t 'unsure)))

The extract-features function is almost as straightforward, though it requires a bit more code. For the moment, the features you'll extract will be the words appearing in the text. For each word, you need to keep track of the number of times it has been seen in a spam and the number of times it has been seen in a ham. A convenient way to keep those pieces of data together with the word itself is to define a class, word-feature, with three slots.

(defclass word-feature ()
((word
:initarg :word
:accessor word
:initform (error "Must supply :word")
:documentation "The word this feature represents.")
(spam-count
:initarg :spam-count
:accessor spam-count
:initform 0
:documentation "Number of spams we have seen this feature in.")
(ham-count
:initarg :ham-count
:accessor ham-count
:initform 0
:documentation "Number of hams we have seen this feature in.")))

You'll keep the database of features in a hash table so you can easily find the object representing a given feature. You can define a special variable, *feature-database*, to hold a reference to this hash table.

(defvar *feature-database* (make-hash-table :test #'equal))

You should use DEFVAR rather than DEFPARAMETER because you don't want *feature-database* to be reset if you happen to reload the file containing this definition during development—you might have data stored in *feature-database* that you don't want to lose. Of course, that means if you do want to clear out the feature database, you can't just reevaluate the DEFVAR form. So you should define a function clear-database.

(defun clear-database ()
(setf *feature-database* (make-hash-table :test #'equal)))

To find the features present in a given message, the code will need to extract the individual words and then look up the corresponding word-feature object in *feature-database*. If *feature-database* contains no such feature, it'll need to create a new word-feature to represent the word. You can encapsulate that bit of logic in a function, intern-feature, that takes a word and returns the appropriate feature, creating it if necessary.

(defun intern-feature (word)
(or (gethash word *feature-database*)
(setf (gethash word *feature-database*)
(make-instance 'word-feature :word word))))

You can extract the individual words from the message text using a regular expression. For example, using the Common Lisp Portable Perl-Compatible Regular Expression (CL-PPCRE) library written by Edi Weitz, you can write extract-words like this:[249]

(defun extract-words (text)
(delete-duplicates
(cl-ppcre:all-matches-as-strings "[a-zA-Z]{3,}" text)
:test #'string=))

Now all that remains to implement extract-features is to put extract-features and intern-feature together. Since extract-words returns a list of strings and you want a list with each string translated to the corresponding word-feature, this is a perfect time to use MAPCAR.

(defun extract-features (text)
(mapcar #'intern-feature (extract-words text)))

You can test these functions at the REPL like this:

SPAM> (extract-words "foo bar baz")
("foo" "bar" "baz")

And you can make sure the DELETE-DUPLICATES is working like this:

SPAM> (extract-words "foo bar baz foo bar")
("baz" "foo" "bar")

You can also test extract-features.

SPAM> (extract-features "foo bar baz foo bar")
(#<WORD-FEATURE @ #x71ef28da> #<WORD-FEATURE @ #x71e3809a>
#<WORD-FEATURE @ #x71ef28aa>)

However, as you can see, the default method for printing arbitrary objects isn't very informative. As you work on this program, it'll be useful to be able to print word-feature objects in a less opaque way. Luckily, as I mentioned in Chapter 17, the printing of all objects is implemented in terms of a generic function PRINT-OBJECT, so to change the way word-feature objects are printed, you just need to define a method on PRINT-OBJECT that specializes on word-feature. To make implementing such methods easier, Common Lisp provides the macro PRINT-UNREADABLE-OBJECT.[250]

The basic form of PRINT-UNREADABLE-OBJECT is as follows:

(print-unreadable-object (object stream-variable &key type identity)
body-form*)

The object argument is an expression that evaluates to the object to be printed. Within the body of PRINT-UNREADABLE-OBJECT, stream-variable is bound to a stream to which you can print anything you want. Whatever you print to that stream will be output by PRINT-UNREADABLE-OBJECT and enclosed in the standard syntax for unreadable objects, #<>.[251]

PRINT-UNREADABLE-OBJECT also lets you include the type of the object and an indication of the object's identity via the keyword parameters type and identity. If they're non-NIL, the output will start with the name of the object's class and end with an indication of the object's identity similar to what's printed by the default PRINT-OBJECT method for STANDARD-OBJECTs. For word-feature, you probably want to define a PRINT-OBJECT method that includes the type but not the identity along with the values of the word, ham-count, and spam-count slots. Such a method would look like this:

(defmethod print-object ((object word-feature) stream)
(print-unreadable-object (object stream :type t)
(with-slots (word ham-count spam-count) object
(format stream "~s :hams ~d :spams ~d" word ham-count spam-count))))

Now when you test extract-features at the REPL, you can see more clearly what features are being extracted.

SPAM> (extract-features "foo bar baz foo bar")
(#<WORD-FEATURE "baz" :hams 0 :spams 0>
#<WORD-FEATURE "foo" :hams 0 :spams 0>
#<WORD-FEATURE "bar" :hams 0 :spams 0>)

Оглавление книги


Генерация: 1.041. Запросов К БД/Cache: 3 / 1
поделиться
Вверх Вниз