String Types / Practical Common Lisp / Библиотека (книги, учебники и журналы) / В помощь Веб-Мастеру

Обложка
Аннотация

Peter Siebel i

Книги автора: Practical Common Lisp

Книга: Practical Common Lisp

String Types

String Types

The other kinds of primitive types that are ubiquitous in the ID3 format are strings. In the previous chapter I discussed some of the issues you have to consider when dealing with strings in binary files, such as the difference between character codes and character encodings.

ID3 uses two different character codes, ISO 8859-1 and Unicode. ISO 8859-1, also known as Latin-1, is an eight-bit character code that extends ASCII with characters used by the languages of Western Europe. In other words, the code points from 0-127 map to the same characters in ASCII and ISO 8859-1, but ISO 8859-1 also provides mappings for code points up to 255. Unicode is a character code designed to provide a code point for virtually every character of all the world's languages. Unicode is a superset of ISO 8859-1 in the same way that ISO 8859-1 is a superset of ASCII—the code points from 0-255 map to the same characters in both ISO 8859-1 and Unicode. (Thus, Unicode is also a superset of ASCII.)

Since ISO 8859-1 is an eight-bit character code, it's encoded using one byte per character. For Unicode strings, ID3 uses the UCS-2 encoding with a leading byte order mark.[274] I'll discuss what a byte order mark is in a moment.

Reading and writing these two encodings isn't a problem—it's just a question of reading and writing unsigned integers in various formats, and you just finished writing the code to do that. The trick is how you translate those numeric values to Lisp character objects.

The Lisp implementation you're using probably uses either Unicode or ISO 8859-1 as its internal character code. And since all the values from 0-255 map to the same characters in both ISO 8859-1 and Unicode, you can use Lisp's CODE-CHAR and CHAR-CODE functions to translate those values in both character codes. However, if your Lisp supports only ISO 8859-1, then you'll be able to represent only the first 255 Unicode characters as Lisp characters. In other words, in such a Lisp implementation, if you try to process an ID3 tag that uses Unicode strings and if any of those strings contain characters with code points higher than 255, you'll get an error when you try to translate the code point to a Lisp character. For now I'll assume either you're using a Unicode-based Lisp or you won't process any files containing characters outside the ISO 8859-1 range.

The other issue with encoding strings is how to know how many bytes to interpret as character data. ID3 uses two strategies I mentioned in the previous chapter—some strings are terminated with a null character, while other strings occur in positions where you can determine the number of bytes to read, either because the string at that position is always the same length or because the string is at the end of a composite structure whose overall size you know. Note, however, that the number of bytes isn't necessarily the same as the number of characters in the string.

Putting all these variations together, the ID3 format uses four ways to read and write strings—two characters crossed with two ways of delimiting the string data.

Obviously, much of the logic of reading and writing strings will be quite similar. So, you can start by defining two binary types, one for reading strings of a specific length (in characters) and another for reading terminated strings. Both types take advantage of that the type argument to read-value and write-value is just another piece of data; you can make the type of character to read a parameter of these types. This is a technique you'll use quite a few times in this chapter.

(define-binary-type generic-string (length character-type) (:reader (in) (let ((string (make-string length))) (dotimes (i length) (setf (char string i) (read-value character-type in))) string)) (:writer (out string) (dotimes (i length) (write-value character-type out (char string i))))) (define-binary-type generic-terminated-string (terminator character-type) (:reader (in) (with-output-to-string (s) (loop for char = (read-value character-type in) until (char= char terminator) do (write-char char s)))) (:writer (out string) (loop for char across string do (write-value character-type out char) finally (write-value character-type out terminator))))

With these types available, there's not much to reading ISO 8859-1 strings. Because the character-type argument you pass to read-value and write-value of a generic-string must be the name of a binary type, you need to define an iso-8859-1-char binary type. This also gives you a good place to put a bit of sanity checking on the code points of characters you read and write.

(define-binary-type iso-8859-1-char () (:reader (in) (let ((code (read-byte in))) (or (code-char code) (error "Character code ~d not supported" code)))) (:writer (out char) (let ((code (char-code char))) (if (<= 0 code #xff) (write-byte code out) (error "Illegal character for iso-8859-1 encoding: character: ~c with code: ~d" char code)))))

Now defining the ISO 8859-1 string types is trivial using the short form of define-binary-type as follows:

(define-binary-type iso-8859-1-string (length) (generic-string :length length :character-type 'iso-8859-1-char)) (define-binary-type iso-8859-1-terminated-string (terminator) (generic-terminated-string :terminator terminator :character-type 'iso-8859-1-char))

Reading UCS-2 strings is only slightly more complex. The complexity arises because you can encode a UCS-2 code point in two ways: most significant byte first (big-endian) or least significant byte first (little-endian). UCS-2 strings therefore start with two extra bytes, called the byte order mark, made up of the numeric value #xfeff encoded in either big-endian form or little-endian form. When reading a UCS-2 string, you read the byte order mark and then, depending on its value, read either big-endian or little-endian characters. Thus, you'll need two different UCS-2 character types. But you need only one version of the sanity-checking code, so you can define a parameterized binary type like this:

(define-binary-type ucs-2-char (swap) (:reader (in) (let ((code (read-value 'u2 in))) (when swap (setf code (swap-bytes code))) (or (code-char code) (error "Character code ~d not supported" code)))) (:writer (out char) (let ((code (char-code char))) (unless (<= 0 code #xffff) (error "Illegal character for ucs-2 encoding: ~c with char-code: ~d" char code)) (when swap (setf code (swap-bytes code))) (write-value 'u2 out code))))

where the swap-bytes function can be defined as follows, taking advantage of LDB being SETFable and thus ROTATEFable:

(defun swap-bytes (code) (assert (<= code #xffff)) (rotatef (ldb (byte 8 0) code) (ldb (byte 8 8) code)) code)

Using ucs-2-char, you can define two character types that will be used as the character-type arguments to the generic string functions.

(define-binary-type ucs-2-char-big-endian () (ucs-2-char :swap nil)) (define-binary-type ucs-2-char-little-endian () (ucs-2-char :swap t))

Then you need a function that returns the name of the character type to use based on the value of the byte order mark.

(defun ucs-2-char-type (byte-order-mark) (ecase byte-order-mark (#xfeff 'ucs-2-char-big-endian) (#xfffe 'ucs-2-char-little-endian)))

Now you can define length- and terminator-delimited string types for UCS-2-encoded strings that read the byte order mark and use it to determine which variant of UCS-2 character to pass as the character-type argument to read-value and write-value. The only other wrinkle is that you need to translate the length argument, which is a number of bytes, to the number of characters to read, accounting for the byte order mark.

(define-binary-type ucs-2-string (length) (:reader (in) (let ((byte-order-mark (read-value 'u2 in)) (characters (1- (/ length 2)))) (read-value 'generic-string in :length characters :character-type (ucs-2-char-type byte-order-mark)))) (:writer (out string) (write-value 'u2 out #xfeff) (write-value 'generic-string out string :length (length string) :character-type (ucs-2-char-type #xfeff)))) (define-binary-type ucs-2-terminated-string (terminator) (:reader (in) (let ((byte-order-mark (read-value 'u2 in))) (read-value 'generic-terminated-string in :terminator terminator :character-type (ucs-2-char-type byte-order-mark)))) (:writer (out string) (write-value 'u2 out #xfeff) (write-value 'generic-terminated-string out string :terminator terminator :character-type (ucs-2-char-type #xfeff))))

Оглавление книги