What are the data structures come out of the HTML parser in golang?

If you pass the two tags below into the go stdlib html parser, the parser's output has a few meaningful differences. This has been puzzling me in Andrew, which aims to remove the things I find tedious about maintaining a website using html, css and javascript by hand.

I want to sort html pages in the { AndrewTableOfContents } structure by date; I also want to design Andrew to get this sort of metadata either through an obvious standard html tag, or from a custom meta tag. The outcome of calling Parse is a series of data structures, all of which are html.Nodes.

These are the tags that get parsed differently:

                    
<title>article title</title>
<meta name="metaname" content="metacontent" />

This is the code I wrote to inspect these things. It creates two byte arrays of the tags, and parses each through html.ParseFragment, which can parse a snippet of html:

                    
package main

import (
    "bytes"
    "fmt"
    "os"

    "github.com/davecgh/go-spew/spew"
    "golang.org/x/net/html"
)

func main() {
    title := []byte("<title>article title</title>")
    meta := []byte("<meta name=\"metaname\" content=\"metacontent\">")

    for i, elem := range [][]byte{title, meta} {
        // html.Parse returns the entire parse tree
        parsed, err := html.ParseFragment(bytes.NewReader(elem), nil)
        if err != nil {
            panic(err)
        }

        os.WriteFile(fmt.Sprint(i)+".txt", []byte(spew.Sdump(parsed)), 0o755)
    }
}

What's in these HTML nodes?

The HTML is parsed into a list of *html.Node. There's a FirstChild, Parent, LastChild, PrevSIbling and NextSibling set of relationships that are tracked. I've redacted some in the structure dumps shown as irrelevant/empty.

Each new html node is a Child of some element. We're dumping parsed fragments of html, so the parser arranges for an empty head node as the topmost node; its first FirstChild is a head node, and the structure keeps being traversable by calling FirstChild, NextSibling or whatever other relationship you want to traverse. There's a good recursive function provided as an example in the go docs for traversing every element in the parsed tree.

Here's what I've figured out about the fields in the html.Node type:

MSDN Web API Node docs

lookup table in the godocs

Data: The comment in the go package tells us data contains "tag name for element nodes, content for text, and are part of a tree of Nodes". I don't see in spew's Sdump output how to distinguish element nodes from text; it _does_ seem like an element node will have a DataAtom that's populated, and it will have a FirstChild that has no DataAtom but _does_ contain Data, which in our example is the text held between the <title> and </title>

Namespace: Usually empty in html! In html you have something like <a href="foo"< in xml, you can have <a namspace:href="foo">

title


FirstChild: (*html.Node)(0x1400013a930)({
    Parent: (*html.Node)(0x1400013a8c0)(),
    FirstChild: (*html.Node)(0x1400013a9a0)({
        Parent: (*html.Node)(0x1400013a930)(),
        //redacted empty data members
        DataAtom: (atom.Atom) , 
        Data: (string) (len=13) "article title",
        Namespace: (string) "",
        Attr: ([]html.Attribute) 
    }),
    LastChild: (*html.Node)(0x1400013a9a0)({
        //redacted empty data members
        Data: (string) (len=13) "article title",
        Namespace: (string) "",
        Attr: ([]html.Attribute) 
    }),
    PrevSibling: (*html.Node)(),
    NextSibling: (*html.Node)(),
    Type: (html.NodeType) 3,
    DataAtom: (atom.Atom) title,
    Data: (string) (len=5) "title",
    Namespace: (string) "",
    Attr: ([]html.Attribute) 
    }),

Description of the Differences

Title Element

The title element has been parsed as an *html.Node which a DataAtom of Title and Data of Title. From the go doc comment I quoted above, I don't see why Data contains title also; I'd have to walk through the constructor to figure that out, I guess?

The title element node has a child node with a Data element containing the title.

If we were to change the title to contain an attribute such as an id attribute <title id="foo">, then we'd see that attribute called out in the Attr section at the same level of indentation as the Data node.

Meta Element

The meta element has 2 attributes, stored inside the element itself. It's a self-enclosed tag, so it doesn't have child elements; it's the leaf of its parse tree.

My current approach in Andrew for tracking when an article is published is to use a meta tag; that was where I started diving into these two element nodes, trying to understand why it was easy to grab a title but hard to get the meta tag attributes. Now I know!