HTML Grammar in the Very First Approximation

459 words | 2016-5-25

After getting acquainted with META parser we turn to the grammar of HTML. And also see how to parse the simplest element.

Making Order

The META kernel is an independent part of the engine, so you can put it into a separate file and put the general order in the project. The file toy-engine.asd, which describes the composition and sequence of compilation of engine modules:

;;;; toy-engine.asd

(asdf:defsystem #:toy-engine
  :description "Tiny web-engine"
  :author "Yellow Rabbit <>"
  :license "Public domain"
  :serial t
  :components ((:file "package")
               (:file "dom")
               (:file "meta-parser-core")
               (:file "html-grammar")
               (:file "toy-engine")))

The files html-grammar.lisp and toy-engine.lisp are still empty and consist of one line each:

(in-package #:toy-engine)

Grammar Diagram

The engine will understand the simplest subset of HTML, here is a diagram of its grammar: Grammar scheme for HTM

Symbol Validation Functions

These simple functions, which are needed to check the characters in the META operation @, must be available at compile time:

(eval-when (:compile-toplevel :load-toplevel :execute)
  (defun gen-text-p (ch)
    "Text is always between > and <"
    (char/= ch #\<))
  (defun always-true (ch)
    "Any character is right one"
    (declare (ignore ch))
  (defun azAZ09-p (ch)
    (or (and (char>= ch #\a)
             (char<= ch #\z))
        (and (char>= ch #\A)
             (char<= ch #\Z))
        (digit-char-p ch))))

The function is called not explicitly, but through checking the membership of the data type:

(deftype gen-text ()
  "All except a <"
  '(satisfies gen-text-p))

(deftype any-text ()
  "Any character"
  '(satisfies always-true))

(deftype tag-text ()
  "Tag name is alphanumeric"
  '(satisfies azAZ09-p))

Another useful type is an empty space :smile:

(deftype whitespace ()
  '(member #\Space #\Tab #\LineFeed #\Return #\FormFeed #\Page))

Generation of nodes and the parcer’s skeleton

The parser will create a DOM tree right through the parsing. For each type of node there corresponds its own function, first we will recognize one simple element: text. And its function:

;; Create nodes
;; Text
(defun make-text-node (text)
  (make-instance 'text-node
		 :text text))

The parser accepts at the input a line for parsing, possibly the number of the first and last characters for parsing and returns the root node of the resulting tree. I do not yet know how the child nodes will join, so the parser’s skeleton looks like this:

;;; HTML parser
(defun parse-html (str &optional (index 0) (end (length str)))
  (declare (type fixnum index end))
       ;; node parsers
       (parse-text ()
          "Text until <"
          (let (ch
		 (text (with-output-to-string (s) (matchit $[@(gen-text ch) !(write-char ch s)]))))
	    (if (zerop (length text))
		(make-text-node text)))))

We check on the tree from one text node:

* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
; Loading "toy-engine"

* (in-package :toy-engine)

* (defparameter *str* "  ''' This is a text< kj")

* (parse-html *str*)

#<TEXT-NODE {10050359A3}>
* (pp->dot "" (lambda () (pp-dom *)))


Looks good:: Tree after parsing