What are the data structures come out of the HTML parser in golang?
If you pass the two tags below into the go stdlib html parser, the parser's output has a few meaningful differences. This has been puzzling me in Andrew, which aims to remove the things I find tedious about maintaining a website using html, css and javascript by hand.
I want to sort html pages in the { AndrewTableOfContents } structure by date; I also want to design Andrew to get this sort of metadata either through an obvious standard html tag, or from a custom meta tag. The outcome of calling Parse is a series of data structures, all of which are html.Nodes.
These are the tags that get parsed differently:
<title>article title</title>
<meta name="metaname" content="metacontent" />
This is the code I wrote to inspect these things. It creates two byte arrays of the tags, and parses each through html.ParseFragment, which can parse a snippet of html:
package main
import (
"bytes"
"fmt"
"os"
"github.com/davecgh/go-spew/spew"
"golang.org/x/net/html"
)
func main() {
title := []byte("<title>article title</title>")
meta := []byte("<meta name=\"metaname\" content=\"metacontent\">")
for i, elem := range [][]byte{title, meta} {
// html.Parse returns the entire parse tree
parsed, err := html.ParseFragment(bytes.NewReader(elem), nil)
if err != nil {
panic(err)
}
os.WriteFile(fmt.Sprint(i)+".txt", []byte(spew.Sdump(parsed)), 0o755)
}
}
What's in these HTML nodes?
The HTML is parsed into a list of *html.Node. There's a FirstChild, Parent, LastChild, PrevSIbling and NextSibling set of relationships that are tracked. I've redacted some in the structure dumps shown as irrelevant/empty.
Each new html node is a Child of some element. We're dumping parsed fragments of html, so the parser arranges for an emptyhead
node as the topmost node;
its first FirstChild is a head
node, and the structure keeps being traversable by calling
FirstChild, NextSibling or whatever other relationship you
want to traverse. There's a good recursive
function provided as an example in the go docs for traversing
every element in the parsed tree.
Here's what I've figured out about the fields in the html.Node type:
- NodeType: tracks whether the node is an ErrorNode, TextNode, DocumentNode, ElementNode, COmmentNode,
DoctypeNode or RawNode.
These seem very similar to the types listed on the MSDN Web API Node docs
- DataAtom: There's a lookup table
in the godocs mapping integers to specific html
tag values, such as the <title> or <a> or <imge> tags
- Data: The comment in the go package tells us data contains "tag name for element nodes, content for
text, and are part of a tree of Nodes". I don't
see in spew's Sdump output how to distinguish element nodes from text; it _does_ seem like an
element node will have a DataAtom that's populated, and it
will have a FirstChild that has no DataAtom but _does_ contain Data, which in our example is the
text held between the <title> and </title>
- Namespace: Usually empty in html! In html you have something like <a href="foo"< in xml, you
can have <a namspace:href="foo">
title
FirstChild: (*html.Node)(0x1400013a930)({
Parent: (*html.Node)(0x1400013a8c0)(),
FirstChild: (*html.Node)(0x1400013a9a0)({
Parent: (*html.Node)(0x1400013a930)(),
//redacted empty data members
DataAtom: (atom.Atom) ,
Data: (string) (len=13) "article title",
Namespace: (string) "",
Attr: ([]html.Attribute)
}),
LastChild: (*html.Node)(0x1400013a9a0)({
//redacted empty data members
Data: (string) (len=13) "article title",
Namespace: (string) "",
Attr: ([]html.Attribute)
}),
PrevSibling: (*html.Node)(),
NextSibling: (*html.Node)(),
Type: (html.NodeType) 3,
DataAtom: (atom.Atom) title,
Data: (string) (len=5) "title",
Namespace: (string) "",
Attr: ([]html.Attribute)
}),
meta
FirstChild: (*html.Node)(0x14000166cb0)({
Parent: (*html.Node)(0x14000166c40)(),
FirstChild: (*html.Node)(),
LastChild: (*html.Node)(),
PrevSibling: (*html.Node)(),
NextSibling: (*html.Node)(),
Type: (html.NodeType) 3,
DataAtom: (atom.Atom) meta,
Data: (string) (len=4) "meta",
Namespace: (string) "",
Attr: ([]html.Attribute) (len=2 cap=2) {
(html.Attribute) {
Namespace: (string) "",
Key: (string) (len=4) "name",
Val: (string) (len=8) "metaname"
},
(html.Attribute) {
Namespace: (string) "",
Key: (string) (len=7) "content",
Val: (string) (len=11) "metacontent"
}
}
}),
Description of the Differences
Title Element
The title element has been parsed as an *html.Node which a DataAtom of Title and Data of Title. From the go doc comment I quoted above, I don't see why Data contains title also; I'd have to walk through the constructor to figure that out, I guess?
The title element node has a child node with a Data element containing the title.
If we were to change the title to contain an attribute such as an id attribute <title id="foo">, then we'd see that attribute called out in the Attr section at the same level of indentation as the Data node.
Meta Element
The meta element has 2 attributes, stored inside the element itself. It's a self-enclosed tag, so it doesn't have child elements; it's the leaf of its parse tree.
My current approach in Andrew for tracking when an article is published is to use a meta tag; that was where I started diving into these two element nodes, trying to understand why it was easy to grab a title but hard to get the meta tag attributes. Now I know!