Asshows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages . Base-122 may however be useful as a general binary-to-text encoding.
External binary resources like images, fonts, audio, etc. can be embedded in HTML through base-64 encoded data URIs . A typical use is to embed small images to avoid HTTP requests and decrease load time.
[/code] Using the embedded data URI avoids the extra HTTP request to fetch "example.png" from the server. This can improve load time in some cases. It has been recommended to use data URIs cautiously. They may help for small images, but could hurt performance otherwise.
Before discussing how base-122 improves upon base-64, we will briefly discuss how base-64 works. The Wikipedia article gives a much more in depth introduction, but we will cover the main points.
§1.2 Base-64 Encoding
Base-64 is one approach to solving the more general problem of binary-to-text encoding . For example, suppose we want to embed the single byte 01101001 in a text file. Since text files may only consist of text characters this byte needs to be represented by text characters. A straightforward way to encode this byte is to map each bit to a text character, say "A" to represent a 0 bit and "B" to represent a 1 bit. For the sake of example, suppose this silly encoding did exist and that single byte 01101001 represented a (very small) image. The corresponding data URI in HTML could look as follows.
[/code] The encoding is certainly easy to decode, but only at the cost of wasted space. Each text character "A" or "B" takes one byte in the HTML file. This means we are encoding one bit of binary data with eight bits of text data, so the data inflation ratio is 8 : 1.
Base-64 encoding is an improvement of our silly encoding. It maps chunks of six bits (representing the numbers 0-63) to one of 64 characters. Each resulting character is one byte, so the inflation ratio is 8 : 6.
[/code] Base-64 works well as a binary-to-text encoding since the characters it produces are standard ASCII characters. But to improve upon base-64, we'll have to investigate how much data we can cram into UTF-8 characters.
§1.3 Text Encodings and UTF-8
Since the majority of web pages are encoded in UTF-8, base-122 exploits the properties of how UTF-8 characters are encoded. Let's first clarify some of the terminology regarding UTF-8 and text-encodings.
Figure 1: Three representations of ε A code point is a number representing (usually) a single character. Unicode is a widely accepted standard describing a large range of code points from the standard multilingual alphabets to more obscure characters like a coffee cup at code point 0x2615 ☕ (often denoted U+2615). See this Unicode table for a reference of code points.
A text encoding on the other hand, is responsible for describing how code points are represented in binary (e.g. in a file). UTF-8 is by far the most common text encoding on the web. It has a variable length encoding and can represent 1,112,064 different code points. Code points representing ASCII characters require only one byte to encode in UTF-8, while higher code points require up to four bytes. Table 1 below summarizes the format of UTF-8 encodings of different code point ranges.
Code point range UTF-8 Format (x reserved for code point) Total Bits Bits for code point Inflation 0x00 - 0x7F 0xxxxxxx 8 7 8 : 7 0x80 - 0x7FF 110xxxxx 10xxxxxx 16 11 16 : 11 0x800 - 0xFFFF 1110xxxx 10xxxxxx 10xxxxxx 24 16 24 : 16 0x10000 - 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 32 21 32 : 21 Table 1: A summary of UTF-8 encoding. x represents bits reserved for code point data Inflation is the ratio of the character bits to the code point bits. The ratio of 1 : 1 effectively means no waste. The inflation ratio worsens as the number of bytes increases since less bits are reserved for the code point.
The encoding of a UTF-8 one-byte character in Table 1 suggests that we can encode seven bits of input data in one encoded byte as in Figure 2.
Figure 2: An attempt at encoding seven bits per byte If the encoding of Figure 2 worked, it would improve the 8 : 6 inflation ratio of base-64 to 8 : 7. However, we want this binary-to-text encoding to be used in the context of HTML pages. Unfortunately, this encoding will not work since some one-byte UTF-8 characters cause conflicts in HTML.
§2.1 Avoiding Illegal Characters
The problem with the above approach is that some characters cannot be safely used in the context of an HTML page. We want our encoding to be stored in format similar to data URIs:
[code][/code] Immediately we see that our encoded data cannot contain the double quote character or the browser will not properly parse the src attribute. In addition, the newline and carriage return characters split the line. Backslashes and ampersands may create inadvertent escape sequences. And the non-displayable null character ( code point 0x00 ) is also problematic since it is parsed as an error character (0xFFFD). Therefore, the null, backslash, ampersand, newline, carriage return, and double quote are considered illegal in the one-byte UTF-8 character. This leaves us with 122 legal one-byte UTF-8 characters to use. These 122 characters can almost encode seven bits of input data. When a seven bit sequence would result in an illegal character, we need to compensate. This is the idea which leads us to our final encoding.
§2.2 Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx . If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx . Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx . The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx , equivalent to a bitwise OR with 0x80 (this can likely be improved, see). Figure 3 summarizes the complete base-122 encoding.