: Learning GNU Emacs, 3rd Edition Basic operators Basic operators

Regular expressions began as an idea in theoretical computer science, but they have found their way into many nooks and crannies of everyday, practical computing. The syntax used to represent them may vary, but the concepts are much the same everywhere. You probably already know a subset of regular expression notation: the wildcard characters used by the Unix shell or Windows command prompt to match filenames. The Emacs notation is a bit different; it is similar to those used by the language Perl, editors like ed and vi and Unix software tools like lex and grep. So let's start with the Emacs regular expression operators that resemble Unix shell wildcard character, which are listed in Table 11-5.

Table11-5.Basic regular expression operators

Emacs operator Equivalent Function
. ? Matches any character.
.* * Matches any string.
[abc] [abc] Matches a, b, or c.
[a-z] [a-z] Matches any lowercase letter.

For example, to match all filenames beginning with program in the Unix shell, you would specify program*. In Emacs, you would say program.*. To match all filenames beginning with a through e in the shell, you would use [a-e]* or [abcde]*; in Emacs, it's [a-e].* or [abcde].*. In other words, the dash within the brackets specifies a range of characters.[78] We will provide more on ranges and bracketed character sets shortly.

To specify a character that is used as a regular expression operator, you need to precede it with a double-backslash, as in * to match an asterisk. Why a double backslash? The reason has to do with the way Emacs Lisp reads and decodes strings. When Emacs reads a string in a Lisp program, it decodes the backslash-escaped characters and thus turns double backslashes into single backslashes. If the string is being used as a regular expressionthat is, if it is being passed to a function that expects a regular expression argumentthat function uses the single backslash as part of the regular expression syntax. For example, given the following line of Lisp:

(replace-regexp "fred*" "bob*")

the Lisp interpreter decodes the string fred* as fred* and passes it to the replace-regexp command. The replace-regexp command understands fred* to mean fred followed by a (literal) asterisk. Notice, however, that the second argument to replace-regexp is not a regular expression, so there is no need to backslash-escape the asterisk in bob* at all. Also notice that if you were to invoke the this as a user command, you would not need to double the backslash, that is, you would type M-x replace-regexp Enter followed by fred* and bob*. Emacs decodes strings read from the minibuffer differently.

The * regular expression operator in Emacs (by itself) actually means something different from the * in the Unix shell: it means "zero or more occurrences of whatever is before the *." Thus, because . matches any character, .* means "zero or more occurrences of any character," that is, any string at all, including the empty string. Anything can precede a *: for example, read* matches "rea" followed by zero or more d's; file[0-9]* matches "file" followed by zero or more digits.

Two operators are closely related to *. The first is +, which matches one or more occurrences of whatever precedes it. Thus, read+ matches "read" and "readdddd" but not "rea," and file[0-9]+ requires that there be at least one digit after "file." The second is ?, which matches zero or one occurrence of whatever precedes it (i.e., makes it optional). html? matches "htm" or "html," and file[0-9]? matches "file" followed by one optional digit.

Before we move on to other operators, a few more comments about character sets and ranges are in order. First, you can specify more than one range within a single character set. The set [A-Za-z] can thus be used to specify all alphabetic characters; this is better than the nonportable [A-z]. Combining ranges with lists of characters in sets is also possible; for example, [A-Za-z_] means all alphabetic characters plus underscore, that is, all characters allowed in the names of identifiers in C. If you give ^ as the first character in a set, it acts as a "not" operator; the set matches all characters that aren't the characters after the ^. For example, [^A-Za-z] matches all nonalphabetic characters.

A ^ anywhere other than first in a character set has no special meaning; it's just the caret character. Conversely, - has no special meaning if it is given first in the set; the same is true for ]. However, we don't recommend that you use this shortcut; instead, you should double-backslash-escape these characters just to be on the safe side. A double backslash preceding a nonspecial character usually means just that characterbut watch it! A few letters and punctuation characters are used as regular expression operators, some of which are covered in the following section. We list "booby trap" characters that become operators when double-backslash-escaped later. The ^ character has a different meaning when used outside of ranges, as we'll see soon.

: 0.416. /Cache: 3 / 0