UTF 7 vs. UTF 8

By G.D. Palmer

UTF-7 and UTF-8 are both types of Unicode Transformation Format, the standard used to encode 16-bit Unicode characters such as international letters and special symbols in a format that can be transmitted through 7-bit or 8-bit systems. UTF-8 is the most commonly used encoding format, popular in Web pages and many email programs. UTF-7 provides encoding for some email protocols that won't work with UTF-8.

Unicode

Unicode is an international standard for representing characters as integers. It uses 16 bits per character versus the seven bits used by ASCII, the American Standard Code for Information Interchange. ASCII can reproduce only 128 characters versus the 65,000 unique characters available in Unicode. This wider range of characters makes Unicode more appropriate for East Asian languages and others with large character sets, but Unicode characters must be encoded if they are to be transmitted via 7-bit or 8-bit bit channels.

UTF-8

UTF-8 is the most common Universal Transformation Format used to convert Unicode characters into 8-bit segments for transmission over the Web via email or through other 8-bit channels. This coding format changes each Unicode character into one to four octets, depending on the integer value of the Unicode character, and it is very efficient for documents that primarily use letters also found in the ASCII character set. UTF-8 tends to take up more space than single-byte encoding for non-Western alphabets.

UTF-7

UTF-7 is a special variant of the Universal Transformation Format first proposed in the mid-1990s. It was designed to represent Unicode text with a string of ASCII characters, producing a more efficient encoding method for email than UTF-8 plus the quoted-printable encoding needed to transmit over a 7-bit data path. Using UTF-7 encoding reduces the size of the encoded characters significantly.

Considerations

Although UTF-7 is more efficient over 7-bit channels than UTF-8 plus quoted-printable, most authorities including the Internet Mail Consortium and Microsoft Developer Network recommend using UTF-8 over UTF-7 whenever possible. This is because UTF-7 creates security and robustness issues not present in its 8-bit relative. The IMC also recommends that all mail-displaying programs created after January 1, 1999 should be capable of displaying mail in UTF-8.