"UTF" stands for Unicode Transformation Format. The "8" says it uses 8 bits to express each character. UTF-16 on the other hand expresses each character with 16 bits.
OK. That information isn't very helpful for captioning. What's important is to realize that UNICODE characters can be expressed in more than one way. So just knowing that your DVD authoring system wants UNICODE text is only part of the story. You also need to know if the UNICODE needs to be expressed in UTF-8 or UTF-16.
One particularly smart thing the UNICODE designers did was to make the first 128 characters of UNICODE the same as ANSI ASCII, so pure ASCII documents are also UNICODE. The 'Latin 1' character set (ISO-8859-1) is a variation of the original IBM PC's 8 bit character set. When when graphics cards became popular is wasn't necessary to reserve characters for drawing lines and boxes around text. So this variant replaces some of the drawing characters to characters unique to a greater variety of languages. Later, these extra 96 characters retained their character numbers in UNICODE. In short, ASCII is a subset of ISO-8859-1 which, in turn, is a subset of UNICODE.
The focus if the UNICODE designers was on assigning a unique number for for every printable character in every language, the designers are less interested in telling people how that number must be expressed. Although UNICODE provides a number of ways to express characters above hexadecimal 7F (127 in decimal), none are quite the same as ISO-8859-1.
Further reference material on how characters are expressed can be found in Roman Czyborra's excellent The ISO 8859 Alphabet Soup (go to the author's site or an archived copy at Linköpings University in Sweden).
8 bits can express any number up to 256. But UNICODE has a lot more that 256 characters. So more than one system to express the "extra" characters UNICODE characters evolved.
UTF-8 is one such system. It uses a variable number of bytes (from 1 to 4 according to RFC3629) to express each character. It is a good solution for 8 bit microprocessors used in industrial equipment for example.
But many applications evolved using only 7 bits. People didn't like to wait for things like modems, so clever engineers figured that they could speed up communications by not transmitting one of the 8 bits in each character. After all, they reasoned, the basic alphabet for each western language fits in 127 characters. And surely the sender and receiver will speak the same language.
The ANSI ASCII standard, and expressing characters in 7 bits, evolved from this need for efficient standardized communications. Everything would work fine in Western as long as the sender and receiver agree on the language for the transmission.
By the time UNICODE got off the ground, using 7 bits to express each character was common. By then it was also pretty unnecessary to conserve one crummy bit. Hard drives were bigger, RAM was cheap and 300 baud modems were used only for door stops. So the UTF-8 folks went back to expressing characters with 8 bits and cleverly put the 8th bit to work.
In characters expressed according to UTF-8, the most significant bit (MSB) of each byte will be 0 for single byte characters. If the MSB is 1, it signifies this byte is part of a multibyte character. The idea is to signal that more than one byte will be used to express a single character.
Obviously this UTF-8 scheme is indistinguishable from 8-bit expressions such as ANSI ASCII (eg. Latin1) in which all characters are 8 bits and all characters beyond 127 have the high bit set.
So somehow the captioner has to know if characters in a document are expressed according to UTF-8 or ANSI ASCII or some other scheme. Unfortunately, there is no sure fire way to tell, but here's what to look for:
- Check accented characters, are they rendered correctly? If they are missing or wrong you are probably have UTF-8 when ANSI ASCII was expected.
- If you have smile faces and goofy blotchy characters, you probably have UTF-8 when ANSI ASCII was expected.
- If all you have in an entire document is one or two characters the odds are you have UTF-16 when UTF-8 or ANSI ASCII was expected. The reason is that UTF-16 expresses every character with two bytes, the first of which is often a null (zero) in Western languages -- the null is often considered the end of a document by UTF-8 and ANSI ASCII processors.
Use your word processor.
Open the transcript or DVD subtitling asset in an ordinary word processor and use the Save As feature to convert between ANSI ASCII variations, UTF-8, and UTF-16.
Rich Text Format (RTF) UNICODE documents always start by saying how the characters are expressed so they should be no problem importing into AutoCaption.
Problems occur when a DVD authoring package wants old style UTF-8 multibyte assets and AutoCaption has generated UTF-16. In that case, simply open the UTF-16 file in your word processor and use Save As to save it in UTF-8.
The number of leading 1 bits in the first byte of a multi-byte sequence is equal to the total number of bytes. Each of the follow-on bytes will have the first bit set to 1 and the second to zero. All remaining bits (shown as 'x' below) are used to respresent the character number.
1 byte character 0xxxxxxx
2 byte character 110xxxxx 10xxxxxx
3 byte character 1110xxxx 10xxxxxx 10xxxxxx
4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-16 encoding is an alternative byte expression of Unicode which for most cases amounts to a fixed-width 16 bit code. ASCII and Latin1 characters (the first 256 characters) are expressed in 8 bits as normal, but with a preceding null (all bits low) byte. Although 16 bit expressions are conceptually simpler than UTF-8 they have two major drawbacks, unless, like AutoCaption, the text handling is designed for UTF-16:
- ASCII text doubles in size when expressed according to UTF-16.
- Normally null bytes are reserved to signal the end of a document or line.
The advantage of using UTF-16 far outweigh the drawbacks:
- UTF-16 does not suck up processor time to figure out how many bytes to cobble together for each letter. And that CPU time is handy for managing video, audio or time code.
- UTF-16 is more robust because an errant bit will not cause bytes to be improperly combined and possibly causing the processor to run off the end of the document into who knows what.
- UTF-16 is the way the UNICODE standards are expressed.
For more information, visit the UTF-8 and Unicode FAQ for Unix/Linux.
The general outline and logic of this user bulletin must be attributed to the excellent discussion at sourceforge.net, we made only minor changes to hopefully make the material more appropriated to people who caption.
user_bulletin_utf-8.html 80519