Not All Documents Are Created Equal

You never know what you're going to get when your word processor makes a document -- even with a "TXT" or "RTF" extent. If it contains special characters or accented characters, it may not be suitable for closed captioning.

The reason is that the document is nothing more than a bunch of numbers. Each number represents a different character. 65 117 116 110 67 97 112 116 104 110 119, for example, displays "AutoCaption" by calling for the 65th, 117th, 116th and so forth characters in whatever font my computer's using.

But what if the characters in the font I'm using aren't in the same order as the font you're using? You might see "!Kzpb>ktyPp%" instead.

Well, it's pretty obvious that the basic alphabet has to have a consistent numbering scheme. But consistency takes a back seat when it comes to special characters like &, £, ¥, á, è and ö. Each individual font foundry is free to decide which special characters to include and how to number them. So not all characters are in every font and the numbering scheme is free to vary.

Clearly some sort of standardization was necessary for at least the basic alphabet. Otherwise communication between terminals and data processing equipment made by different vendors would produce gibberish.

In 1963 the American National Standards Institute (ANSI) started trying to bring order to at least the basic North American character set. After five years of wrangling, the 128 character American Standard Code for Information Interchange (ASCII) was adopted in 1968.

Here's a chart of the characters in the ASCII set (position your pointer on the character to get the decimal and hexadecimal (hex) value). The green squares are characters in the closed caption character set later promulgated by the United States Federal Communications Commission (FCC) in the early 1990s:

Current communication standards like ISO-8859-1, ANSI-X3.4-1986(R1997) and even UNICODE (ISO-10646) incorporate and expand on the original ASCII effort attributed largely to Robert W. Bemer.

Unfortunately many "standards" that expand the original 128 characters still call their version "ASCII."

Documents that claim to be ASCII can contain numbers for characters like "sexed" quotes (“ and ”), the apostrophe (’) or the ellipsis character (…) that combines three periods into a single character.

These characters are typically inserted by word processors where an AutoCorrect feature is enabled.

To make matters worse, none of these characters are legal closed captioning characters. Closed caption decoders render illegal characters as a rectangle.

The same caveats apply to accented characters. In a Rich Text document, accented characters are probably in UNICODE with a non-UNICODE alternative if the rendering font can't handle UNICODE. Unfortunately, these characters can't always be ANSI ASCII 7 bit, most of the time they're one version or another of "enhanced" or "extended" ASCII. The result is that the accented characters may either disappear or get distorted when a Rich Text document is converted.

The most common ASCII extension is the "Latin" character set used by the original IBM PC (Technical Reference, revised edition, 1984, IBM Corporation publication number 6322507, §7-12 Characters, Keystrokes, and Colors). It includes the closed captioning characters shown in green squares (with the exception of the quarter note, trademark symbol and registered trademark symbol that AutoCaption adds at the positions shown in pink):

We've had some luck converting Rich Text documents to Latin 9 (ISO) plain text (*.txt) in Microsoft® Word.™ But one should always check carefully when working with accented or special characters.

Note: When closed captioning quite a few foreign language characters in the original IBM® PC character set are not in the character set authorized by the FCC. The missing characters include ì, ò, ù, ¡, €, Ç, all umlauted characters, and most of the accented upper case characters except the Ñ (upper case "N" with a tilde accent).