Marzhill Musings

Using bufio.Scanner to build a Tokenizer

Published On: 2013-05-09 00:00:00

Go's standard lib comes with a number of useful tools. Each one of them written in an idiomatic fashion illustrating how to design good api's in Go. One of the most informative and useful are the io packages. both the "io" and "bufio" packages are examples of beautiful and idiomatic Go. I've recently decided to look at using the bufio.Scanner to build a tokenizer for a very simple lisp. The tokenizer's job is to emit a stream of tokens for each element of the lisp's syntax. The syntax is very basic with only the following tokens.

  • Whitespace a series of newline, linefeed, tab or space characters
  • StartExpr: a single '('
  • EndExpr: a single ')'
  • String: anything between double or single quotes with \ escaping the next character.
  • Number: any series of numbers surrounded by whitespace
  • Word: any series of characters surrounded by whitespace and starting with a-z or A-Z

The tokenizer we are going to build should emit a stream of these tokens for us while also keeping track of where in the file these tokens occur. [Go's bufio package](http://golang.org/pkg/bufio) has a useful general purpose Scanner type that splits an io.Reader into a series of byte slices for you. Our first task is to figure out how to use the Scanner to split up any reader into a stream of slices representing each of these tokens. The Scanner type has 5 methods. We'll be focusing on just 4 of them.

  • The Scan method scans the stream looking for the next byte slice to emit. It returns false if the Scan hit an error or EOF.
  • The Bytes method returns the current byte slice after Scan has run.
  • The Err method returns the last non io.EOF error encountered by Scan.
  • The Split method sets the SplitFunc to use for the next Scan.

Together these give us all the tools necessary to tokenize any io.Reader. The first step is to create a Scanner to use.

package main

import (
       "strings"
       "bufio"
       "fmt"
)

func main() {
     s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))
     for s.Scan() {
         tok := s.Bytes()
         fmt.Println("Found token ", tok)
     }
}

You can play with this code here http://play.golang.org/p/fhOV_4fVOI. Out of the box the scanner splits the io.Reader into lines which is obviously not what we want. In order to let us specify our own rules for splitting bufio.Scanner uses composition. We can pass in our splitting logic in the form of a SplitFunc. A SplitFunc takes a candidate slice of data for splitting and returns how far to advance the position in the reader stream, the byte slice to return for this split, and an error if any is needed. Below you'll find the basic skeleton for our splitting logic.

package main

import (
       "strings"
       "bufio"
       "fmt"
)

func consumeWord(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeWhitespace(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeNum(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeString(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func main() {
     s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))
     split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		// Because our grammar is simple we can switch off the first
		// character in the reader.
		switch data[0] {
		case '(', ')':
			advance, token, err = 1, data[:1], nil
		case '"', '\'':
			advance, token, err = consumeString(data)
		case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
			advance, token, err = consumeNum(data)
		case ' ', '\n', '\r', '\t':
			advance, token, err = consumeWhitespace(data)
		default:
			advance, token, err = consumeWord(data)
		}
		return
	}
	s.Split(split)
     for s.Scan() {
         tok := s.Bytes()
         fmt.Println("Found token ", tok)
     }
}

The split func inspects the start of our candidate slice and determines how it should handle it. The parens case is easy so we implented it first by telling the Scanner to advance by one byte and emit a byte slice of size one containing the paren and nil for the error. We've defined names for consume functions for each of our other tokens but left them unimplemented so far. We'll start with consumeWord to give an example of how they should work.

func consumeWord(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if b == ' ' || b == '\n' || b == '\t' || b == '\r' {
			return i, accum, nil
		} else {
			accum = append(accum, b)
		}
	}
	return 0, nil, nil
}

Tokenizing a Word token is easy. A Word starts with any character that isn't a quote, paren or Number and they continue until you hit whitespace. Our switch statement handles the quote, paren, and number cases specially and anything left over is a Word. Then we call consumeWord with the candidate slice. consumeWords job is to scan the candidate for whitespace accumulating as we go. We could if we want bypass the accumulator but this was easier to understand for now. If no whitespace is seen then we don't have a full word yet so return 0, nil, nil. If we do encounter any whitespace then return the amount to advance, not including the whitespace, as well the word we've accumulated and nil. The scanner will keep trying our split function on increasing slice sizes until it either gets an error, non-zero advance or hits the end of the stream. We code similar consume functions for the Whitespace, String, and Number tokens.

func consumeWhitespace(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if b == ' ' || b == '\n' || b == '\t' || b == '\r' {
			accum = append(accum, b)
		} else {
			return i, accum, nil
		}
	}
	return 0, nil, nil
}

func consumeNum(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if '0' <= b && b <= '9' {
			accum = append(accum, b)
		} else {
			return i, accum, nil
		}
	}
	return len(accum), accum, nil
}

func consumeString(data []byte) (int, []byte, error) {
	delim := data[0]
	skip := false
	accum := []byte{data[0]}
	for i, b := range data[1:] {
		accum = append(accum, b)
		if b == delim && !skip {
			return i + 2, accum, nil
		}
		skip = !skip && b == '\\'
	}
	return 0, nil, nil
}

If you play with this code here http://play.golang.org/p/dksq4YjJtb. then you'll see it emit one of each of our tokens.

We still aren't done though. This Scanner while useful doesn't tell us our Token Locations. We need it to tell us the Line and column that each token starts on. To do this we'll use composition again. Only this time we'll compose two structs using Go's embedded fields.

type lineTrackingReader struct {
	// Embedded Scanner so our LineTrackingReader can be used just like
        // a Scanner.
	*bufio.Scanner
	lastL, lastCol int // Position Tracking fields.
	l, col         int
}

func newTrackingReader(r io.Reader) *lineTrackingReader {
	s := bufio.NewScanner(r)
	rdr := &lineTrackingReader{
		Scanner: s,
	}
	split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		if rdr.l == 0 { // Initiliaze our position tracking fields using a split like a closure.
			rdr.l = 1
			rdr.col = 1
			rdr.lastL = 1
			rdr.lastCol = 1
		}
		switch data[0] {
		case '(', ')':
			advance, token, err = 1, data[:1], nil
		case '"', '\'': // TODO(jwall): Rune data?
			advance, token, err = consumeString(data)
		case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
			advance, token, err = consumeNum(data)
		case ' ', '\n', '\r', '\t':
			advance, token, err = consumeWhitespace(data)
		default:
			advance, token, err = consumeWord(data)
		}
		// If we are advancing then we should see if there are any line
		// endings here
		if advance > 0 {
			rdr.lastCol = rdr.col
			rdr.lastL = rdr.l
			for _, b := range data[:advance] {
				if b == '\n' || atEOF { // Treat eof as a newline
					rdr.l++
					rdr.col = 1
				}
				rdr.col++
			}
		}

		return
	}
	s.Split(split)
	return rdr
}

func main() {
	s := newTrackingReader(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))

	for s.Scan() {
		tok := s.Bytes()
		fmt.Printf(
			"Found token %q at line #%d col #%d\n",
			string(tok), s.Line(), s.Column)
	}
}

func (l *lineTrackingReader) Line() int {
	return l.lastL
}

func (l *lineTrackingReader) Column() int {
	return l.lastCol
}

You can play with the finished code here http://play.golang.org/p/_omf7AGefg.


Tags:

Using bufio.Scanner to build a Tokenizer

Published On: 2013-05-09 00:00:00

Go's standard lib comes with a number of useful tools. Each one of them written in an idiomatic fashion illustrating how to design good api's in Go. One of the most informative and useful are the io packages. both the "io" and "bufio" packages are examples of beautiful and idiomatic Go. I've recently decided to look at using the bufio.Scanner to build a tokenizer for a very simple lisp. The tokenizer's job is to emit a stream of tokens for each element of the lisp's syntax. The syntax is very basic with only the following tokens.

  • Whitespace a series of newline, linefeed, tab or space characters
  • StartExpr: a single '('
  • EndExpr: a single ')'
  • String: anything between double or single quotes with \ escaping the next character.
  • Number: any series of numbers surrounded by whitespace
  • Word: any series of characters surrounded by whitespace and starting with a-z or A-Z

The tokenizer we are going to build should emit a stream of these tokens for us while also keeping track of where in the file these tokens occur. [Go's bufio package](http://golang.org/pkg/bufio) has a useful general purpose Scanner type that splits an io.Reader into a series of byte slices for you. Our first task is to figure out how to use the Scanner to split up any reader into a stream of slices representing each of these tokens. The Scanner type has 5 methods. We'll be focusing on just 4 of them.

  • The Scan method scans the stream looking for the next byte slice to emit. It returns false if the Scan hit an error or EOF.
  • The Bytes method returns the current byte slice after Scan has run.
  • The Err method returns the last non io.EOF error encountered by Scan.
  • The Split method sets the SplitFunc to use for the next Scan.

Together these give us all the tools necessary to tokenize any io.Reader. The first step is to create a Scanner to use.

package main

import (
       "strings"
       "bufio"
       "fmt"
)

func main() {
     s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))
     for s.Scan() {
         tok := s.Bytes()
         fmt.Println("Found token ", tok)
     }
}

You can play with this code here http://play.golang.org/p/fhOV_4fVOI. Out of the box the scanner splits the io.Reader into lines which is obviously not what we want. In order to let us specify our own rules for splitting bufio.Scanner uses composition. We can pass in our splitting logic in the form of a SplitFunc. A SplitFunc takes a candidate slice of data for splitting and returns how far to advance the position in the reader stream, the byte slice to return for this split, and an error if any is needed. Below you'll find the basic skeleton for our splitting logic.

package main

import (
       "strings"
       "bufio"
       "fmt"
)

func consumeWord(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeWhitespace(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeNum(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func consumeString(data []byte) (int, []byte, error) {
	// TODO
	return 0, nil, nil
}

func main() {
     s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))
     split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		// Because our grammar is simple we can switch off the first
		// character in the reader.
		switch data[0] {
		case '(', ')':
			advance, token, err = 1, data[:1], nil
		case '"', '\'':
			advance, token, err = consumeString(data)
		case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
			advance, token, err = consumeNum(data)
		case ' ', '\n', '\r', '\t':
			advance, token, err = consumeWhitespace(data)
		default:
			advance, token, err = consumeWord(data)
		}
		return
	}
	s.Split(split)
     for s.Scan() {
         tok := s.Bytes()
         fmt.Println("Found token ", tok)
     }
}

The split func inspects the start of our candidate slice and determines how it should handle it. The parens case is easy so we implented it first by telling the Scanner to advance by one byte and emit a byte slice of size one containing the paren and nil for the error. We've defined names for consume functions for each of our other tokens but left them unimplemented so far. We'll start with consumeWord to give an example of how they should work.

func consumeWord(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if b == ' ' || b == '\n' || b == '\t' || b == '\r' {
			return i, accum, nil
		} else {
			accum = append(accum, b)
		}
	}
	return 0, nil, nil
}

Tokenizing a Word token is easy. A Word starts with any character that isn't a quote, paren or Number and they continue until you hit whitespace. Our switch statement handles the quote, paren, and number cases specially and anything left over is a Word. Then we call consumeWord with the candidate slice. consumeWords job is to scan the candidate for whitespace accumulating as we go. We could if we want bypass the accumulator but this was easier to understand for now. If no whitespace is seen then we don't have a full word yet so return 0, nil, nil. If we do encounter any whitespace then return the amount to advance, not including the whitespace, as well the word we've accumulated and nil. The scanner will keep trying our split function on increasing slice sizes until it either gets an error, non-zero advance or hits the end of the stream. We code similar consume functions for the Whitespace, String, and Number tokens.

func consumeWhitespace(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if b == ' ' || b == '\n' || b == '\t' || b == '\r' {
			accum = append(accum, b)
		} else {
			return i, accum, nil
		}
	}
	return 0, nil, nil
}

func consumeNum(data []byte) (int, []byte, error) {
	var accum []byte
	for i, b := range data {
		if '0' <= b && b <= '9' {
			accum = append(accum, b)
		} else {
			return i, accum, nil
		}
	}
	return len(accum), accum, nil
}

func consumeString(data []byte) (int, []byte, error) {
	delim := data[0]
	skip := false
	accum := []byte{data[0]}
	for i, b := range data[1:] {
		if b == delim && !skip {
			return i + 2, accum, nil
		}
		skip = false
		if b == '\\' {
			skip = true
			continue
		}
		accum = append(accum, b)
	}
	return 0, nil, nil
}

If you play with this code here http://play.golang.org/p/dksq4YjJtb. then you'll see it emit one of each of our tokens.

We still aren't done though. This Scanner while useful doesn't tell us our Token Locations. We need it to tell us the Line and column that each token starts on. To do this we'll use composition again. Only this time we'll compose two structs using Go's embedded fields.

type lineTrackingReader struct {
	// Embedded Scanner so our LineTrackingReader can be used just like
        // a Scanner.
	*bufio.Scanner
	lastL, lastCol int // Position Tracking fields.
	l, col         int
}

func newTrackingReader(r io.Reader) *lineTrackingReader {
	s := bufio.NewScanner(r)
	rdr := &lineTrackingReader{
		Scanner: s,
	}
	split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		if rdr.l == 0 { // Initiliaze our position tracking fields using a split like a closure.
			rdr.l = 1
			rdr.col = 1
			rdr.lastL = 1
			rdr.lastCol = 1
		}
		switch data[0] {
		case '(', ')':
			advance, token, err = 1, data[:1], nil
		case '"', '\'': // TODO(jwall): Rune data?
			advance, token, err = consumeString(data)
		case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
			advance, token, err = consumeNum(data)
		case ' ', '\n', '\r', '\t':
			advance, token, err = consumeWhitespace(data)
		default:
			advance, token, err = consumeWord(data)
		}
		// If we are advancing then we should see if there are any line
		// endings here
		if advance > 0 {
			rdr.lastCol = rdr.col
			rdr.lastL = rdr.l
			for _, b := range data[:advance] {
				if b == '\n' || atEOF { // Treat eof as a newline
					rdr.l++
					rdr.col = 1
				}
				rdr.col++
			}
		}

		return
	}
	s.Split(split)
	return rdr
}

func main() {
	s := newTrackingReader(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')"))

	for s.Scan() {
		tok := s.Bytes()
		fmt.Printf(
			"Found token %q at line #%d col #%d\n",
			string(tok), s.Line(), s.Column)
	}
}

func (l *lineTrackingReader) Line() int {
	return l.lastL
}

func (l *lineTrackingReader) Column() int {
	return l.lastCol
}

You can play with the finished code here http://play.golang.org/p/_omf7AGefg.


Tags:

go-html-transform an html transformation and scraping library

Published On: 2013-02-26 17:05:00

Status: DRAFT

http://code.google.com/p/go-html-transform is my html transformation library for go. I use it as an html templating language and scraping library. It's not your typical approach to html templating but it's an approach I've really come to enjoy. HTML templating can be grouped in roughly about 3 categories.

  1. Templating languages.
  2. HTML DSLs.
  3. Functional transforms.

go-html-transform is an example of that last one. The basic theory is that an html template is just data. No logic is in the template. All the logic is in the functions that operate on the template and any input data. Using the input data you can transfrom a template and then render the transformed AST back into html. This has a number of benefits.

  • Your template transforms are context aware.
  • Multipass templating is just another transform.
  • All your logic is expressed in real honest to goodness code not a limited templating language. In the case of go-html-transform your templating logic is actually typechecked by the go compiler.
  • It's impossible to generate bad html.
  • Your mocks are your templates.
  • You can use an html dsl in combination with this approach as well if the dsl outputs the same AST.

Example usage.

package main

import (
  "strings"
  "os"

  "code.google.com/p/go-html-transform/html/transform"
  "code.google.com/p/go-html-transform/h5"
)

func toSSL(url string) string {
  return strings.Replace("http:", "https:", 1)
}

func main() {
  f, err := os.Open("~/file.html")
  defer f.Close()
  if err != nil { return } // handle errors here.
  tree, err := transform.NewDocFromReader(f)
  if err != nil { return } // handle errors here.
  t := transform.NewTransformer(tree)
  t.ApplyAll(
    Trans(ReplaceChildren(h5.Text("foo"), "span"), // replace every span tags contents with foo
    // turn every link and img into an ssl link
    Trans(TransformAttrib("href", toSSL), "a"),
    Trans(TransformAttrib("src", toSSL), "img"),
  )

  t.Render(os.Stdout) // render html to stdout.
}

Tags:

From Homeless to a Career

Published On: 2013-02-22 12:14:00

A number of folks have requested I write up the story of how I went from homeless with kids and a wife to a career as a software developer. The story could start from any number of places but I'll start mine as a young teenager. Our family didn't have a lot of money. I wasn't really aware of it since my parents did a good job of ensuring we always had food and a place to live. One year though they took the tax return and used it to purchase a computer. That computer opened up a whole world to me. It came with a copy of gw-basic and some games written in basic that I could play. Not only could I play them but I could see the source code when I started them up. My first ventures into programming were modifying those first games. Mostly so I could cheat. This is s a perfectly hacker like mentality. The challenge of cheating was more interesting than playing the actual game.

Before long though, I started to wonder if I could do more. I also loved science fiction at the time and I had read a book, whose title I no longer remember, that featured Conways Game of Life as a part of the plot. The rules, of which there were only 4, seemed very well suited to a program so my first truly creative endeavor with basic was programming a version of the Game of Life. It would be many years later before I discovered on the internet how popular this game was with geeky computer people like me.

Fast forward a few years and I'm on my way to college. I wasn't as you might expect going for a computer science degree though. I wanted to be a virus hunter for the CDC. I loved biology and viruses were just really cool. I got a 95% paid scholarship to Missouri Baptist College in St. Louis and away I go. While I'm there I meet a wonderful girl, the love of my life. We end up getting married and I take a fulltime job at the college which had two benefits that were really appealing. It was a full ride tuition to the school. It was fulltime income to support my new family. Things seemed like they were progressing just they way I wanted.

What I didn't know was how hard it is to attend school, work full-time, and keep a marriage going. My progression in school slowed and to top it off I began to get the feeling that working in a biology lab somewhere probably wasn't going to be what I really wanted to do with my life. Cue life crisis. Sudenly I didn't really know what I should do next. I was coasting. The whole time my hobby with computers continued to hang in the background. Eventually the college ends up laying me off. I lose the full ride, I no longer have the scholarships since I had to give those up when I took the full ride. I have a family with 4 kids and no job. It all starts crashing down. What am I supposed to do now?

Of course when you don't have a job it's hard to make the rent, buy food, and otherwise make ends meet. I go through a succession of different jobs never really holding on to any of them until finally it all culminates with my family being homeless. My wife is sticking by me this whole time never once questioning her choice to marry me. Something that to this day still amazes me. We end up having to live with family and I'm working at labor ready to make ends meet. And the whole while my hobby is still hanging in the background. There were days when I thought I was going to have to sell my computer to get food. My wife always insisted I keep it though. I think she knew this was going to be the key to getting out of the whole we were in someday.

So there I am I have a computer, an internet connection. And pretty much nothing else. What's a Guy supposed to do when he's not job searching? Of Course! Hang out on the internet. Two very influential Open Source projects became my life at this point. Blender and Planeshift. I hung out in their irc chatrooms and message boards I downloaded the code and I started hacking in earnest. I learned C and C++ on those projects and more importantly I interacted with people who not only knew how to code but did it professionally and managed a large project. Keep in mind this is still all just my hobby.

Eventually I get a full time job working for a company managing networks in school districts in Southern Illinois and Missouri. We start to pull out of the hole we were in financially and even purchase a house. A dirt cheap house that needed a lot of work but things were looking up. Then disaster strikes again. My boss at the company had never quite figured out what to do with me. He knew I had some experience as a programmer in my hobby and I knew my way around a network. He couldn't quite decide if I was going to be a part of the new programming division or just another network admin. Eventually that indecision led to my being let go. Which could have been a problem except that I had made some contacts with people through the job and was able to immediately get work as a contractor for a local company doing web development in perl.

I start updating my blog with what I learn while working and even publish a CPAN module or two. I'm pretty satisified with where I landed and figure this is how my life will be for the forseeable future. I'm not looking for a job but my blog and personal network in open source have started to get me a little attention. Not much but enough that one day a recruiter from Chicago cold calls me. I've gotten used to being a small time operator and getting refused at all my attempts to get a job in the industry so I'm assuming of course that this is a fluke and I'll never get the job. I do a phone interview, then another phone interview, then they call me in for a face to face. I start to believe that maybe I have a chance here. After the face to face I wait and they call me back saying I've got the job. Not only that but I discover the company is getting bought by Google pending approval from various governments. Long story short I get hired, the company purchase goes through, I survive the transition to Google and I find myself working at one of the top companies in my new career.

So why is this story interesting? I think it highlights a few things that are still true about the Software Development industry today.

  1. The most important contributor to your success is your desire to learn and your persistence.
  2. Programming has more free and self driven avenues for learning than any other subject matter. And the resources have only gotten better since I started.
  3. OpenSource is a merit based approach to gaining credentials and learning in this field.

So, if your passion is Software Development, there is literally nothing standing in your way except you.

One Last thing

Throughout all of the above events I had family and friends praying for me. I'm a devout christrian and I strongly believe I'm at Google now because God wants me there. Their support was a large part of my success as well.


Tags:

A long overdue update

Published On: 2013-02-13 22:24:00

Hello there,

I haven't updated this blog in over two years. That's a long time for something on the internet. I still get traffic for various articles though so I figured I should probably get back into the swing of things.

I've been working at Google as some of you may already know. I also still do a lot of hacking in my spare time. Lately most of it has been in Go. But I do the occasional thing in Javascript, Erlang, and Haskell as well.

So lets take stock shall we?

  • Latest Projects:
  • Revamped this blog's html and css. I do hope the colors and typography are better. I'll be continuing to tweak it over the next few weeks.
  • Teaching some classes in Downtown chicago for friends.

I hope to write some articles about Go soon as well. The first will probably be about go-html-transform.

I've also been heavily active in church. We attend FBC Lansing now. It's a healthy church and I now serve on the Deacon board as well as teaching Sunday School and Youth Bible study.


Tags: