An HTML document is an SGML document; that is, a sequence of characters organized physically into a set of entities, and logically as a hierarchy of elements.
In the SGML specification, the first production of the SGML syntax grammar separates an SGML document into three parts: an SGML declaration, a prologue, and an instance. For the purposes of this specification, the prologue is a DTD. This DTD describes another grammar: the start symbol is given in the doctype declaration, the terminals are data characters and tags, and the productions are determined by the element declarations. The instance must conform to the DTD, that is, it must be in the language defined by this grammar.
The SGML declaration determines the lexicon of the grammar. It specifies the document character set, which determines a character repertoire that contains all characters that occur in all text entities in the document, and the code positions associated with those characters.
The SGML declaration also specifies the syntax-reference character set of the document, and a few other parameters that bind the abstract syntax of SGML to a concrete syntax. This concrete syntax determines how the sequence of characters of the document is mapped to a sequence of terminals in the grammar of the prologue.
For example, consider the following document:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"> <title>Parsing Example</title> <p>Some text. <em>*wow*</em></p>
An HTML user agent should use the SGML declaration that is given in 9.5, "SGML Declaration for HTML". According to its document character set, `*' refers to an asterisk character, `*'.
The instance above is regarded as the following sequence of terminals:
1. start-tag: TITLE 2. data characters: "Parsing Example" 3. end-tag: TITLE 4. start-tag: P 5. data characters "Some text." 6. start-tag: EM 7. data characters: "*wow*" 8. end-tag: EM 9. end-tag: P
The start symbol of the DTD grammar is HTML, and the productions are given in the public text identified by `-//IETF//DTD HTML 2.0//EN' (9.1, "HTML DTD"). The terminals above parse as:
HTML | \-HEAD | | | \-TITLE | | | \-<TITLE> | | | \-"Parsing Example" | | | \-</TITLE> | \-BODY | \-P | \-<P> | \-"Some text. " | \-EM | | | \-<EM> | | | \-"*wow*" | | | \-</EM> | \-</P>
Some of the elements are delimited explicitly by tags, while the boundaries of others are inferred. The <HTML> element contains a <HEAD> element and a <BODY> element. The <HEAD> contains <TITLE>, which is explicitly delimited by start- and end-tags.