Marzhill Musings

go-html-transform an html transformation and scraping library

Published On: 2013-02-26 17:05:00

http://code.google.com/p/go-html-transform is my html transformation library for go. I use it as an html templating language and scraping library. It's not your typical approach to html templating but it's an approach I've really come to enjoy. HTML templating can be grouped in roughly about 3 categories.

  1. Templating languages.
  2. HTML DSLs.
  3. Functional transforms.

go-html-transform is an example of that last one. The basic theory is that an html template is just data. No logic is in the template. All the logic is in the functions that operate on the template and any input data. Using the input data you can transfrom a template and then render the transformed AST back into html. This has a number of benefits.

  • Your template transforms are context aware.
  • Multipass templating is just another transform.
  • All your logic is expressed in real honest to goodness code not a limited templating language. In the case of go-html-transform your templating logic is actually typechecked by the go compiler.
  • It's impossible to generate bad html.
  • Your mocks are your templates.
  • You can use an html dsl in combination with this approach as well if the dsl outputs the same AST.

Example usage.

    1 package main
    2 
    3 import (
    4   "strings"
    5   "os"
    6 
    7   "code.google.com/p/go-html-transform/html/transform"
    8   "code.google.com/p/go-html-transform/h5"
    9 )
   10 
   11 func toSSL(url string) string {
   12   return strings.Replace("http:", "https:", 1)
   13 }
   14 
   15 func main() {
   16   f, err := os.Open("~/file.html")
   17   defer f.Close()
   18   if err != nil { return } // handle errors here.
   19   tree, err := transform.NewDocFromReader(f)
   20   if err != nil { return } // handle errors here.
   21   t := transform.NewTransformer(tree)
   22   t.ApplyAll(
   23     Trans(ReplaceChildren(h5.Text("foo"), "span"), // replace every span tags contents with foo
   24     // turn every link and img into an ssl link
   25     Trans(TransformAttrib("href", toSSL), "a"),
   26     Trans(TransformAttrib("src", toSSL), "img"),
   27   )
   28 
   29   t.Render(os.Stdout) // render html to stdout.
   30 }

Tags: