综合编程

A grammar for HTML5

微信扫一扫,分享到朋友圈

A grammar for HTML5
0

The HTML5 specification uses pseudo-code to specify how HTML documents
should be parsed. Here’s a taste:

  1. If the byte at position
    is one of 0x09 (ASCII TAB), 0x0A
    (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or
    0x2F (ASCII /) then advance position
    to the next byte and redo
    this substep.

  2. If the byte at position
    is 0x3E (ASCII >), then abort
    the “get an attribute” algorithm. There isn’t one.

  3. Otherwise, the byte at position
    is the start of the
    attribute name. Let attribute name and attribute value be the empty
    string.

This style of specification has provoked consternation among some
,
who prefer the “declarative” style of the HTML 4
specification, based on grammars.

Fortunately, I have spent a great deal of time over the past ten years
learning about parsing and its security aspects, and I believe I can
give here a very succinct grammar for HTML5.

A few preliminaries. I will be using a variant of Backus-Naur Form
(BNF) grammars, in which “.” will denote any single input
character, and postfix “*” will denote zero or more
repetitions of the preceding construct (Kleene closure). I will use
capitalized identifiers for the nonterminals of the grammar.

Here then is the grammar of HTML 5:

HTML5 = .*

Yes! No kidding, that really is the grammar. Any input that matches
this grammar—which is to say, any input at all—is going to be
accepted by just about any web browser, which will do its best to
render something sensible. You can think of this as a degenerate case
ofPostel’s Law, in which browsers are extremely
liberal in what they accept from others. They accept
everything!

To be fair, the HTML5
specification

does discuss “parse errors”, and says that HTML user agents can
abort processing when they encounter them. But it also says that they
can continue processing, and that’s what browsers seem to do.

This is no different from HTML 4, whose grammar really should be the
same as this. Web browsers have always been very tolerant of “errors”
in web pages; they try to render as much of their input as possible.
The more complicated grammar that you will find in the HTML 4
specification is incomplete. It is not complete because it does not
say what happens when a browser encounters “garbled” HTML; browsers
have been left to decide this for
themselves

. Naturally
enough, this leads to browsers that behave differently on the same
input: browser incompatibility. And that leads, in turn, to certain security vulnerabilities
.

The defenders of “declarative” specifications will note that HTML 4’s
syntax specification is not only
a grammar. That’s true, there is
also a lot of English prose confusing things. Here are some questions
for the defenders: is the HTML 4 specification equivalent to “.*”? If
not, then when an input does not
conform to the grammar of HTML 4,
what DOM tree will a browser produce? (The answers are “no” and “only
your browser knows”.)

The pseudo-code of the HTML5 specification is charmless, but it is
pretty easy to convince yourself that it accepts “.*”.

阅读原文...

微信扫一扫,分享到朋友圈

A grammar for HTML5
0
Lobsters

《樱花大战》女主角名首字母连起来为SEGA?是奇迹般的巧合!

上一篇

因为这个功能,这些完好的二手 iPhone 不得不被拆解

下一篇

评论已经被关闭。

插入图片

热门分类

往期推荐

A grammar for HTML5

长按储存图像,分享给朋友