技术控

    今日:30| 主题:49312
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Base-122 – A space efficient alternative to base-64

[复制链接]
讨厌那个你ヾ 发表于 2016-11-27 23:12:20
78 1

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
As a binary-to-text encoding, base-64 inflates the size of the data it represents by ~33%. This article presents base-122, a UTF-8 binary-to-text encoding which inflates the original data by only ~14%. Base-122 was created with the web in mind. The implementation includes a small Javascript decoder to load base-122 encoded resources in web pages.
     Disclaimer

    Asshows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages . Base-122 may however be useful as a general binary-to-text encoding.
     External binary resources like images, fonts, audio, etc. can be embedded in HTML through base-64 encoded data URIs . A typical use is to embed small images to avoid HTTP requests and decrease load time.
  [code]
Base-122 – A space efficient alternative to base-64-1 (Disclaimer,represents,efficient,resources,includes)


Base-122 – A space efficient alternative to base-64-2 (Disclaimer,represents,efficient,resources,includes)[/code]   Using the embedded data URI avoids the extra HTTP request to fetch "example.png" from the server. This can improve load time in some cases. It has been recommended to use data URIs cautiously. They may help for small images, but could hurt performance otherwise.
   Before discussing how base-122 improves upon base-64, we will briefly discuss how base-64 works. The Wikipedia article gives a much more in depth introduction, but we will cover the main points.
   §1.2 Base-64 Encoding

   Base-64 is one approach to solving the more general problem of binary-to-text encoding . For example, suppose we want to embed the single byte 01101001 in a text file. Since text files may only consist of text characters this byte needs to be represented by text characters. A straightforward way to encode this byte is to map each bit to a text character, say "A" to represent a 0 bit and "B" to represent a 1 bit. For the sake of example, suppose this silly encoding did exist and that single byte 01101001 represented a (very small) image. The corresponding data URI in HTML could look as follows.
  [code]
Base-122 – A space efficient alternative to base-64-3 (Disclaimer,represents,efficient,resources,includes)[/code]  The encoding is certainly easy to decode, but only at the cost of wasted space. Each text character "A" or "B" takes one byte in the HTML file. This means we are encoding one bit of binary data with eight bits of text data, so the data inflation ratio is 8 : 1.
  Base-64 encoding is an improvement of our silly encoding. It maps chunks of six bits (representing the numbers 0-63) to one of 64 characters. Each resulting character is one byte, so the inflation ratio is 8 : 6.
  [code]
Base-122 – A space efficient alternative to base-64-4 (Disclaimer,represents,efficient,resources,includes)[/code]  Base-64 works well as a binary-to-text encoding since the characters it produces are standard ASCII characters. But to improve upon base-64, we'll have to investigate how much data we can cram into UTF-8 characters.
   §1.3 Text Encodings and UTF-8

  Since the majority of web pages are encoded in UTF-8, base-122 exploits the properties of how UTF-8 characters are encoded. Let's first clarify some of the terminology regarding UTF-8 and text-encodings.
     
Base-122 – A space efficient alternative to base-64-5 (Disclaimer,represents,efficient,resources,includes)
      Figure 1: Three representations of ε       A code point is a number representing (usually) a single character. Unicode is a widely accepted standard describing a large range of code points from the standard multilingual alphabets to more obscure characters like a coffee cup at code point 0x2615 ☕ (often denoted U+2615). See this Unicode table for a reference of code points.
   A text encoding on the other hand, is responsible for describing how code points are represented in binary (e.g. in a file). UTF-8 is by far the most common text encoding on the web. It has a variable length encoding and can represent 1,112,064 different code points. Code points representing ASCII characters require only one byte to encode in UTF-8, while higher code points require up to four bytes. Table 1 below summarizes the format of UTF-8 encodings of different code point ranges.
                    Code point range      UTF-8 Format (x reserved for code point)      Total Bits      Bits for code point      Inflation                  0x00 - 0x7F     0xxxxxxx     8     7     8 : 7             0x80 - 0x7FF     110xxxxx 10xxxxxx     16     11     16 : 11             0x800 - 0xFFFF     1110xxxx 10xxxxxx 10xxxxxx     24     16     24 : 16             0x10000 - 0x10FFFF     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     32     21     32 : 21             Table 1: A summary of UTF-8 encoding. x represents bits reserved for code point data      Inflation is the ratio of the character bits to the code point bits. The ratio of 1 : 1 effectively means no waste. The inflation ratio worsens as the number of bytes increases since less bits are reserved for the code point.
  The encoding of a UTF-8 one-byte character in Table 1 suggests that we can encode seven bits of input data in one encoded byte as in Figure 2.

Base-122 – A space efficient alternative to base-64-6 (Disclaimer,represents,efficient,resources,includes)
      Figure 2: An attempt at encoding seven bits per byte      If the encoding of Figure 2 worked, it would improve the 8 : 6 inflation ratio of base-64 to 8 : 7. However, we want this binary-to-text encoding to be used in the context of HTML pages. Unfortunately, this encoding will not work since some one-byte UTF-8 characters cause conflicts in HTML.
   §2.1 Avoiding Illegal Characters

  The problem with the above approach is that some characters cannot be safely used in the context of an HTML page. We want our encoding to be stored in format similar to data URIs:
  [code]Base-122 – A space efficient alternative to base-64-7 (Disclaimer,represents,efficient,resources,includes)[/code]   Immediately we see that our encoded data cannot contain the double quote character or the browser will not properly parse the src attribute. In addition, the newline and carriage return characters split the line. Backslashes and ampersands may create inadvertent escape sequences. And the non-displayable null character ( code point 0x00 ) is also problematic since it is parsed as an error character (0xFFFD). Therefore, the null, backslash, ampersand, newline, carriage return, and double quote are considered illegal in the one-byte UTF-8 character. This leaves us with 122 legal one-byte UTF-8 characters to use. These 122 characters can almost encode seven bits of input data. When a seven bit sequence would result in an illegal character, we need to compensate. This is the idea which leads us to our final encoding.
   §2.2 Base-122 Encoding

   Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx . If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx . Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx . The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx , equivalent to a bitwise OR with 0x80 (this can likely be improved, see). Figure 3 summarizes the complete base-122 encoding.
123下一页
友荐云推荐




上一篇:Proof You Can Become a Better Programmer
下一篇:Firefox 50 for developers
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

雅阳 发表于 2016-11-27 23:59:33
当你的眼泪忍不住要流出来的时候,睁大眼睛,千万别眨眼,你会看到世界由清晰到模糊的全过程
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表