sixbitproxywax

Character Encodings

Some notes about character encodings…

Sources

Most of these notes are based on my reading of Unicode Demystified, which I refer to simply as UN below.

How Do They Work

Relationship Between UTF-8 and UTF-16 and Unicode

Are UTF-8 and UTF-16 just different representations of Unicode? In other words, do they both fully encode Unicode?

Relationship Between Unicode and Other Encodings

Are there any characters or character sets which can not be encoded with Unicode?

In UN, it is stated that using the 16-bit units alone, you can represent “all the characters representable with virtually all of the other character encoding methods that are in reasonably widespread use.”

Later it states that Unicode is, “a superset of of the other character encoding systems,” and that this means it can be used as a lingua franca for document manipulation.

It is not yet clear what the definition of “reasonably widespread use” is, though.

According To The Spec

This page seems to largely answer the questions I have about which character sets simply can not be encoded with Unicode. This seems to imply that Unicode truly is a superset of most languages one would need to realistically work with in most modern applications.

Many To One Glyphs

However, there seems to be some disagreement about what characters are unique in various character sets. In other words, there may be some characters the many consider to be unique, which are encoded the same in Unicode. What is not clear, though, is if such claims are about characters in general, or characters in other character sets. In other words, are these multiple glyphs used outside of data processing, but not treated as unique in any character encoding, or are there characters that are considered unique in some commonly used encodings, but not in Unicode?

Other Concerns

I am also wondering if there are encodings where multiple bytes are used, but in such a manner that the first byte is a flag, of sorts, which determines how the next byte is interpreted. For example, imagine an encoding where lower case and upper case characters have the same encoding, but there are “shift down” and “shift up” keys which cause the encodings from that point until the next shift character to be treated as upper or lower case. In such a case, someone could encode

1
shiftdown a shiftdown b shiftdown c

which, once converted to and from Unicode would most likely be turned into

1
shiftdown a b c

Embedded Data

Using Unicode as a common format also seems to completely break if we are embedding non-character data in a document or perhaps mixing and matching encodings.

Shift Encodings

What are they?

Conversion

Are conversions from one encoding to another always bijections?

UN says that you can use Unicode as a lingua franca by converting documents to it, manipulating, then converting back to the original encoding. However, it’s not clear if this give us actual byte equivalent data. For instance, when discussing possible problems with characters in general, UN mentions:

A single glyph may represent more than one character (such a glyph is often called a ligature), such as the ligature, a single mark that represents the letters f and i together. Also, a single character might be represented by two or more glyphs: The vowel sound au in the Tamil language () is represented by two marks, one that goes to the left of a consonant character, and another on the right; nevertheless, it’s still thought of as a single character.

among other things. Later, it is clarified that:

Unicode, as a rule, doesn’t care about any of these distinctions. It encodes underlying semantic concepts, not visual presentations (characters, not glyphs) and relies on intelligent rendering software (or the user’s choice of fonts) to draw the correct glyphs in the correct places. Unicode does sometimes encode glyphic distinctions, but only when necessary to preserve interoperability with some preexisting standard or to preserve legibility (i.e., if smart rendering software can’t pick the right glyph for a particular character in a particular spot without clues in the encoded text itself).

Which sounds, too me, like: “It’s typically not a problem, but sometimes is.” So, I probably need to figure out what kind of examples are problematic, and how drastically it might impact outputs.

Also, I don’t think the discussion above is really considering the byte level equivalence at all, so that may be even more problematic.

Java Examples

Misc

Something About Internationalization?