Comments as the Second Most Important HTML Element

396 words | 2016-5-31

After we succeeded parse the text element of HTML, we move on to a more complex element - comments.


Comments in HTML are interesting because they end with three characters: -->. That is, if we read from the stream char-by-char instead of a line, then we would have a rare fun with looking forward two characters. Now we can just restore the pointer to the current character.

Creating a comment node is very simple:

(defun make-comment-node ()
  (make-instance 'comment-node))

So to parse the comment first thing, what’s the first thing? The piano! Just kidding:smile: First of all, remember the position of index in the string to be rolled back (oldindex).

The beginning of the comment is easy to define — just the sequence: ["!--" something]1, but with the final -> it’s not so good.

We can not use the sequence [$@(any-text ch) "-->"], because a repeated comparison with any character $@(any-text ch) will simply absorb the entire string without giving a chance to detect -->.

The repetitive alternative ${"-->" @(any-text ch)} is also not an option: although we are now able to detect the end of the comment, but we can not quit the repetition.

To work the comparison with --> should not work :smile: That is, by finding --> we remember the fact of detection in the variable eoc-found2 and say that the comparison failed !nil. Next, we will consume all the characters in a row only if --> has not been found.

       (parse-comment ()
          "<!-- ??? -->"
          (let (ch eoc-found (oldindex index))
            (or (and (matchit
			{${ ["-->" !(setf eoc-found t) !nil]
			    [!(not eoc-found) @(any-text ch)]
			    } !eoc-found}])
                (progn (setf index oldindex) nil))))

Let’s check, replacing the call to parse-tex in parse-html with call to (cons (parse-comment) (princ index))):

* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
; Loading "toy-engine"

* (in-package :toy-engine)

* (defparameter *str* "!--  ''' This is a text< kj--  ->  --> 123")

* (length *str*)

* (parse-html *str*)
(#<COMMENT-NODE {1005025E03}> . 38)
* (pp->dot "" (lambda () (pp-dom (car *))))


As you can see from index = 38, the parser correctly absorbed the entire inside of the comment.

Tree after parsing

  1. There is no < for the reason that this character will already be used to distinguish a comment or element from plain text. 

  2. eoc stands for End Of Comment