Êíèãà: Learning GNU Emacs, 3rd Edition
11.3.2.4 Retrieving portions of matches
11.3.2.4 Retrieving portions of matches
There is still one piece missing: the ability to replace "program" with "module" while leaving any s or 's
untouched. This leads to the final regular expression feature we will cover here: the ability to retrieve portions of the matched string for later use. The preceding regular expression is indeed the correct one to give as the search string for replace-regexp. As for the replace string, the answer is module1
; in other words, the required Lisp code is:
(replace-regexp "<program('s|s)?>" "module1")
The 1
means, in effect, "substitute the portion of the matched string that matched the subexpression inside the (
and )
." It is the only regular-expression-related operator that can be used in replacements. In this case, it means to use 's
in the replace string if the match was "program's," s if the match was "programs," or nothing if the match was just "program." The result is the correct substitution of "module" for "program," "modules" for "programs," and "module's" for "program's."
Another example of this feature solves Example 4. To match filenames <filename>.c and replace them with <filename>.java, use the Lisp code:
(replace-regexp "([a-zA-Z0-9_]+).c" "1.java")
Remember that .
means a literal dot (.). Note also that the filename pattern (which matches a series of one or more alphanumerics or underscores) was surrounded by (
and )
in the search string for the sole purpose of retrieving it later with 1
.
Actually, the 1
operator is only a special case of a more powerful facility (as you may have guessed). In general, if you surround a portion of a regular expression with (
and )
, the string matching the parenthesized subexpression is saved. When you specify the replace string, you can retrieve the saved substrings with n
, where n
is the number of the parenthesized subexpression from left to right, starting with 1. Parenthesized expressions can be nested; their corresponding n
numbers are assigned in order of their (
delimiter from left to right.
Lisp code that takes full advantage of this feature tends to contain complicated regular expressions. The best example of this in Emacs's own Lisp code is compilation-error-regexp-alist, the list of regular expressions the compile package (discussed in Chapter 9) uses to parse error messages from compilers. Here is an excerpt, adapted from the Emacs source code (it's become much too long to reproduce in its entirety; see below for some hints on how to find the actual file to study in its full glory):
(defvar compilation-error-regexp-alist
'(
;; NOTE! See also grep-regexp-alist, below.
;; 4.3BSD grep, cc, lint pass 1:
;; /usr/src/foo/foo.c(8): warning: w may be used before set
;; or GNU utilities:
;; foo.c:8: error message
;; or HP-UX 7.0 fc:
;; foo.f :16 some horrible error message
;; or GNU utilities with column (GNAT 1.82):
;; foo.adb:2:1: Unit name does not match file name
;; or with column and program name:
;; jade:dbcommon.dsl:133:17:E: missing argument for function call
;;
;; We'll insist that the number be followed by a colon or closing
;; paren, because otherwise this matches just about anything
;; containing a number with spaces around it.
;; We insist on a non-digit in the file name
;; so that we don't mistake the file name for a command name
;; and take the line number as the file name.
("([a-zA-Z][-a-zA-Z._0-9]+: ?)?
([a-zA-Z]?:?[^:( tn]*[^:( tn0-9][^:( tn]*)[:(][ t]*([0-9]+)
([) t]|:(([0-9]+:)|[0-9]*[^:0-9]))" 2 3 6)
;; Microsoft C/C++:
;; keyboard.c(537) : warning C4005: 'min' : macro redefinition
;; d:tmptest.c(23) : error C2143: syntax error : missing ';' before 'if'
;; This used to be less selective and allow characters other than
;; parens around the line number, but that caused confusion for
;; GNU-style error messages.
;; This used to reject spaces and dashes in file names,
;; but they are valid now; so I made it more strict about the error
;; message that follows.
("(([a-zA-Z]:)?[^:(tn]+)(([0-9]+))
: (error|warning) C[0-9]+:" 1 3)
;; Caml compiler:
;; File "foobar.ml", lines 5-8, characters 20-155: blah blah
("^File "([^," nt]+)", lines? ([0-9]+)[-0-9]*, characters?
([0-9]+)" 1 2 3)
;; Cray C compiler error messages
("(cc| cft)-[0-9]+ c(c|f77): ERROR ([^,n]+, )* File =
([^,n]+), Line = ([0-9]+)" 4 5)
;; Perl -w:
;; syntax error at automake line 922, near "':'"
;; Perl debugging traces
;; store::odrecall('File_A', 'x2') called at store.pm line 90
(".* at ([^ n]+) line ([0-9]+)[,.n]" 1 2)
;; See http://ant.apache.org/faq.html
;; Ant Java: works for jikes
("^s-*[[^]]*]s-*(.+):([0-9]+):([0-9]+):[0-9]+:[0-9]
+:" 1 2 3)
;; Ant Java: works for javac
("^s-*[[^]]*]s-*(.+):([0-9]+):" 1 2)
This is a list of elements that have at least three parts each: a regular expression and two numbers. The regular expression matches error messages in the format used by a particular compiler or tool. The first number tells Emacs which of the matched subexpressions contains the filename in the error message; the second number designates which of the subexpressions contains the line number. (There can also be additional parts at the end: a third number giving the position of the column number of the error, if any, and any number of format strings used to generate the true filename from the piece found in the error message, if needed. For more details about these, look at the actual file, as described below.)
For example, the element in the list dealing with Perl contains the regular expression:
".* at ([^ n]+) line ([0-9]+)[,.n]"
followed by 1 and 2, meaning that the first parenthesized subexpression contains the filename and the second contains the line number. So if you have Perl's warnings turned on—you always do, of course—you might get an error message such as this:
syntax error at monthly_orders.pl line 1822, near "$"
The regular expression ignores everything up to at. Then it finds monthly_orders.pl, the filename, as the match to the first subexpression "[^ n]+
" (one or more nonblank, nonnewline characters), and it finds 1822, the line number, as the match to the second subexpression "[0-9]+
" (one or more digits).
For the most part, these regular expressions are documented pretty well in their definitions. Understanding them in depth can still be a challenge, and writing them even more so! Suppose we want to tackle Example 5 by adding an element to this list for our new C++ compiler that prints error messages in German. In particular, it prints error messages like this:
Fehler auf Zeile linenum in filename: text of error message
Here is the element we would add to compilation-error-regexp-alist:
("Fehler auf Zeile ([0-9]+) in ([^: t]+):" 2 1)
In this case, the second parenthesized subexpression matches the filename, and the first matches the line number.
To add this to compilation-error-regexp-alist, we need to put this line in .emacs:
(setq compilation-error-regexp-alist
(cons '("Fehler auf Zeile ([0-9]+) in ([^: t]+):" 2 1)
compilation-error-regexp-alist))
Notice how this example resembles our example (from Chapter 9) of adding support for a new language mode to auto-mode-alist.