Marzhill Musings

... Next>>

Using bufio.Scanner to build a Tokenizer

Published On: 2013-05-09 00:00:00

Go's standard lib comes with a number of useful tools. Each one of them written in an idiomatic fashion illustrating how to design good api's in Go. One of the most informative and useful are the io packages. both the "io" and "bufio" packages are examples of beautiful and idiomatic Go. I've recently decided to look at using the bufio.Scanner to build a tokenizer for a very simple lisp. The tokenizer's job is to emit a stream of tokens for each element of the lisp's syntax. The syntax is very basic with only the following tokens.

  • Whitespace a series of newline, linefeed, tab or space characters
  • StartExpr: a single '('
  • EndExpr: a single ')'
  • String: anything between double or single quotes with \ escaping the next character.
  • Number: any series of numbers surrounded by whitespace
  • Word: any series of characters surrounded by whitespace and starting with a-z or A-Z

The tokenizer we are going to build should emit a stream of these tokens for us while also keeping track of where in the file these tokens occur. Go's bufio package has a useful general purpose Scanner type that splits an io.Reader into a series of byte slices for you. Our first task is to figure out how to use the Scanner to split up any reader into a stream of slices representing each of these tokens. The Scanner type has 5 methods. We'll be focusing on just 4 of them.

  • The Scan method scans the stream looking for the next byte slice to emit. It returns false if the Scan hit an error or EOF.
  • The Bytes method returns the current byte slice after Scan has run.
  • The Err method returns the last non io.EOF error encountered by Scan.
  • The Split method sets the SplitFunc to use for the next Scan.

Together these give us all the tools necessary to tokenize any io.Reader. The first step is to create a Scanner to use. 1 2 package main 3 4 import ( 5 "strings" 6 "bufio" 7 "fmt" 8 ) 9 10 func main() { 11 s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')")) 12 for s.Scan() { 13 tok := s.Bytes() 14 fmt.Println("Found token ", tok) 15 } 16 }

You can play with this code here http://play.golang.org/p/fhOV_4fVOI. Out of the box the scanner splits the io.Reader into lines which is obviously not what we want. In order to let us specify our own rules for splitting bufio.Scanner uses composition. We can pass in our splitting logic in the form of a SplitFunc. A SplitFunc takes a candidate slice of data for splitting and returns how far to advance the position in the reader stream, the byte slice to return for this split, and an error if any is needed. Below you'll find the basic skeleton for our splitting logic. 1 2 package main 3 4 import ( 5 "strings" 6 "bufio" 7 "fmt" 8 ) 9 10 func consumeWord(data []byte) (int, []byte, error) { 11 // TODO 12 return 0, nil, nil 13 } 14 15 func consumeWhitespace(data []byte) (int, []byte, error) { 16 // TODO 17 return 0, nil, nil 18 } 19 20 func consumeNum(data []byte) (int, []byte, error) { 21 // TODO 22 return 0, nil, nil 23 } 24 25 func consumeString(data []byte) (int, []byte, error) { 26 // TODO 27 return 0, nil, nil 28 } 29 30 func main() { 31 s := bufio.NewScanner(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')")) 32 split := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 33 // Because our grammar is simple we can switch off the first 34 // character in the reader. 35 switch data[0] { 36 case '(', ')': 37 advance, token, err = 1, data[:1], nil 38 case '"', '\'': 39 advance, token, err = consumeString(data) 40 case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': 41 advance, token, err = consumeNum(data) 42 case ' ', '\n', '\r', '\t': 43 advance, token, err = consumeWhitespace(data) 44 default: 45 advance, token, err = consumeWord(data) 46 } 47 return 48 } 49 s.Split(split) 50 for s.Scan() { 51 tok := s.Bytes() 52 fmt.Println("Found token ", tok) 53 } 54 }

The split func inspects the start of our candidate slice and determines how it should handle it. The parens case is easy so we implented it first by telling the Scanner to advance by one byte and emit a byte slice of size one containing the paren and nil for the error. We've defined names for consume functions for each of our other tokens but left them unimplemented so far. We'll start with consumeWord to give an example of how they should work. 1 2 func consumeWord(data []byte) (int, []byte, error) { 3 var accum []byte 4 for i, b := range data { 5 if b == ' ' || b == '\n' || b == '\t' || b == '\r' { 6 return i, accum, nil 7 } else { 8 accum = append(accum, b) 9 } 10 } 11 return 0, nil, nil 12 }

Tokenizing a Word token is easy. A Word starts with any character that isn't a quote, paren or Number and they continue until you hit whitespace. Our switch statement handles the quote, paren, and number cases specially and anything left over is a Word. Then we call consumeWord with the candidate slice. consumeWords job is to scan the candidate for whitespace accumulating as we go. We could if we want bypass the accumulator but this was easier to understand for now. If no whitespace is seen then we don't have a full word yet so return 0, nil, nil. If we do encounter any whitespace then return the amount to advance, not including the whitespace, as well the word we've accumulated and nil. The scanner will keep trying our split function on increasing slice sizes until it either gets an error, non-zero advance or hits the end of the stream. We code similar consume functions for the Whitespace, String, and Number tokens. 1 2 func consumeWhitespace(data []byte) (int, []byte, error) { 3 var accum []byte 4 for i, b := range data { 5 if b == ' ' || b == '\n' || b == '\t' || b == '\r' { 6 accum = append(accum, b) 7 } else { 8 return i, accum, nil 9 } 10 } 11 return 0, nil, nil 12 } 13 14 func consumeNum(data []byte) (int, []byte, error) { 15 var accum []byte 16 for i, b := range data { 17 if '0' <= b && b <= '9' { 18 accum = append(accum, b) 19 } else { 20 return i, accum, nil 21 } 22 } 23 return len(accum), accum, nil 24 } 25 26 func consumeString(data []byte) (int, []byte, error) { 27 delim := data[0] 28 skip := false 29 accum := []byte{data[0]} 30 for i, b := range data[1:] { 31 if b == delim && !skip { 32 return i + 2, accum, nil 33 } 34 skip = false 35 if b == '\\' { 36 skip = true 37 continue 38 } 39 accum = append(accum, b) 40 } 41 return 0, nil, nil 42 }

If you play with this code here http://play.golang.org/p/dksq4YjJtb. then you'll see it emit one of each of our tokens.

We still aren't done though. This Scanner while useful doesn't tell us our Token Locations. We need it to tell us the Line and column that each token starts on. To do this we'll use composition again. Only this time we'll compose two structs using Go's embedded fields. 1 2 type lineTrackingReader struct { 3 // Embedded Scanner so our LineTrackingReader can be used just like 4 // a Scanner. 5 *bufio.Scanner 6 lastL, lastCol int // Position Tracking fields. 7 l, col int 8 } 9 10 func newTrackingReader(r io.Reader) *lineTrackingReader { 11 s := bufio.NewScanner(r) 12 rdr := &lineTrackingReader{ 13 Scanner: s, 14 } 15 split := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 16 // Initiliaze our position tracking fields using a split like a closure. 17 if rdr.l == 0 { 18 rdr.l = 1 19 rdr.col = 1 20 rdr.lastL = 1 21 rdr.lastCol = 1 22 } 23 switch data[0] { 24 case '(', ')': 25 advance, token, err = 1, data[:1], nil 26 case '"', '\'': // TODO(jwall): Rune data? 27 advance, token, err = consumeString(data) 28 case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': 29 advance, token, err = consumeNum(data) 30 case ' ', '\n', '\r', '\t': 31 advance, token, err = consumeWhitespace(data) 32 default: 33 advance, token, err = consumeWord(data) 34 } 35 // If we are advancing then we should see if there are any line 36 // endings here 37 if advance > 0 { 38 rdr.lastCol = rdr.col 39 rdr.lastL = rdr.l 40 for _, b := range data[:advance] { 41 if b == '\n' || atEOF { // Treat eof as a newline 42 rdr.l++ 43 rdr.col = 1 44 } 45 rdr.col++ 46 } 47 } 48 49 return 50 } 51 s.Split(split) 52 return rdr 53 } 54 55 func main() { 56 s := newTrackingReader(strings.NewReader("(foo 1 'bar')\n(baz 2 'quux')")) 57 58 for s.Scan() { 59 tok := s.Bytes() 60 fmt.Printf("Found token %q at line #%d col #%d\n", 61 string(tok), s.Line(), s.Column) 62 } 63 } 64 65 func (l *lineTrackingReader) Line() int { 66 return l.lastL 67 } 68 69 func (l *lineTrackingReader) Column() int { 70 return l.lastCol 71 }

You can play with the finished code here http://play.golang.org/p/_omf7AGefg.


Tags:

Advanced Nitrogen Elements

Published On: 2009-06-16 23:33:58
In my last post I walked you through creating a basic nitrogen element. In this one I'll be covering some of the more advanced topics in nitrogen elements.
  • Event handlers and delegation
  • scripts and dynamic javascript postbacks

Nitrogen Event handlers

Nitrogen event handlers get called for any nitrogen event. A nitrogen event is specified by assigning #event to an actions attribute of a nitrogen element. The event handler in the pages module will get called with the postback of the event. Postbacks are an attribute of the event record and define the event that was fired. To handle the event you create an event function in the target module that matches your postback. For Example: 1 2 % given this event 3 #event{ type=click, postback={click, Id} } 4 % this event function would handle it 5 event({click, ClickedId}) -> 6 io:format("I [~p] was clicked", [ClickedId]) . Erlangs pattern matching makes it especially well suited for this kind of event based programming. The one annoying limitation of this event though is that each page has to handle it individually. You could of course create a dispatching module that handled the event for you but why when nitrogen already did it for you. You can delegate an event to a specific module by setting the delegate attribute to the atom identifying that module. 1 2 % delgated event 3 #event{ type=click, postback={click, Id}, delegate=my_module } You can delgate to any module you want. I use the general rule of thumb that if the event affects other elements on the page then the page module should probably handle it. If, however, the event doesn't affect other elements on the page then the element's module can handle it.

Scripts and Dynamic Postback

Now lets get make it a little more interesting. Imagine a scenario where we want to interact with some javascript on a page and dynamically generate data to send back to nitrogen. As an example lets create a silly element that grabs the mouse coordinates of a click on the element and sends that back to nitrogen. A first attempt might look something like so: 1 2 -record(silly, {?ELEMENT_BASE(element_silly)}). And the module is likewise simple: 1 2 -module(element_silly). 3 -compile(export_all). 4 -include("elements.hrl"). 5 -include_lib("nitrogen/include/wf.inc"). 6 render(ControlId, R) -> 7 Id = wf:temp_id(), 8 %% wait!! where do we get the loc from?! 9 ClickEvent = #event{type=click, postback={click, Loc}} 10 Panel = #panel{id=Id, style="width:'100px' height='100px'", 11 actions=ClickEvent}, element_panel:render(Panel). 12 13 event({click, Loc}) -> 14 wf:update(body, wf:f("you clicked at point: ~s", Loc)). Well of course you spot the problem here. Since the click happens client side we don't know what to put in the Loc variable for the postback. A typical postback won't work because the data will be generated in the client and not the Nitrogen server. So how could we get the value of the coordinates sent back? The javascript to grab the coordinates with jquery looks like this: 1 2 var coord = obj('me').pageX + obj('me').pageY; To plug that in to the click event is pretty easy since action fields in an event can hold other events or javascript or a list combining both: 1 2 Script = "var coord = obj('me').pageX + obj('me').pageY;", 3 ClickEvent = #event{type=click, postback={click, Loc}, actions=Script} Now we've managed to capture the coordinates of the mouse click, but we still haven't sent it back to the server. This javascript needs a little help. What we need is a drop box. Lets enhance our element with a few helpers: 1 2 -module(element_silly). 3 -compile(export_all). 4 -include("elements.hrl"). 5 -include_lib("nitrogen/include/wf.inc"). 6 render(ControlId, R) -> 7 Id = wf:temp_id(), 8 DropBoxId = wf:temp_id(), 9 MsgId = wf:temp_id(), 10 Script = wf:f("var coord = obj('me').pageX + obj('me').pageY; $('~s').value = coord;", 11 [DropBoxId]), 12 ClickEvent = #event{type=click, postback={click, Id, MsgId}, 13 actions=Script}, 14 Panel = #panel{id=Id, style="width:'100px'; height='100px'", 15 actions=ClickEvent, body=[#hidden{id=DropBoxId}, 16 #panel{id=MsgId}]}, 17 element_panel:render(Panel). 18 19 event({click, Id, Msg}) -> 20 Loc = hd(wf:q(Id)), 21 wf:update(Msg, wf:f("you clicked at point: ~s", Loc)). Ahhh there we go. Now our element when clicked will:
  1. use javascript to grab the coordinates of the mouse click
  2. use javascript to store those coordinates in the hidden element
  3. use a postback to send the click event back to a nitrogen event handler with the id of the hidden element where it stored the coordinates.
We have managed to grab dynamically generated data from the client side and drop it somehwere that nitrogen can retrieve it. In the process we have used an event handler, custom javascript, and dynamic javascript postbacks. Edit: Corrected typo - June 16, 2009 at 11:40 pm

Tags:

Creating Custom Nitrogen Elements

Published On: 2009-05-22 19:05:57
Nitrogen is a web framework written in erlang for Fast AJAX Web applications. You can get Nitrogen on github Nitrogen comes with a set of useful controls, or elements in nitrogen parlance, but if you are going to do anything more fancy than a basic hello world you probably want to create some custom controls. This tutorial will walk you through the ins and outs of writing a custom element for Nitrogen. We will be creating a simple notification element similar to one I use in the Iterate! project. It will need to be able to:
  • show a message
  • have a way to dismiss it
  • and optionally expire and disappear after a configurable period of time
Every Nitrogen element has two main pieces: the Record and the Module. I'll go through each in order and walk you through creating our notification element.

The Record

The record defines all the state required to create a nitrogen element. Every record needs a certain base set of fields. These fields can be added to your record with the ?ELEMENT_BASE macro. The macro is available in the nitrogen include file wf.inc. That include file also gives you access to all the included nitrogen element records. Below you can see the record definition for our notify element. Since it is very simple in it's design it only needs the base elements and two additional fields. expire to handle our optional expiration time and default to false to indicate no expiration. msg to hold the contents of our notification.
%Put this line in an include file for your elements
-record(notify, {?ELEMENT_BASE(element_notify), expire=false, msg}).
% put these at the top of your elements module
-include_lib("nitrogen/include/wf.inc").
% the above mentioned include file you may call it whatever you want
-include("elements.inc").
The ELEMENT_BASE macro gives your element several fields and identifies for the element which module handles the rendering of your nitrogen element. You can specify any module you want but the convention is to name the module with element_element_name. The fields provided are: id, class, style, actions, and show_if. You can use them as you wish when it comes time to render your element. Which brings us to the module.

The Module

Of the two pieces of a nitrogen element the module does the manual labor. It renders and in some cases defines the handlers for events fired by the element. The module must export a render/2 function. This function will be called whenever nitrogen needs to render a particular instance of your element. It's two arguments are: The ControlId, and the Record defining this element instance. Of these the ControlID is probably the least understood. It is passed into your render method by nitrogen and is the assigned HTML Id for your particular element. This is important to understand because, when you call the next render method in your elements tree, you will have to pass an ID on. The rule of thumb I use is that if you want to use a different Id for your toplevel element then you can ignore the ControlId. Otherwise you should use it as the id for your toplevel element in the control. So your element's module should start out with something like this:
-module(element_notify).
-compile(export_all).
-include_lib("nitrogen/include/wf.inc").
-include("elements.hrl").
% give us a way to inspect the fields of this elements record
% useful in the shell where record_info isn't available
reflect() -> record_info(fields, notify).
% Render the custom element
render(ControlId, R) ->
    % get a temp id for our notify element instance
    Id = ControlId,
    % Our toplevel of the element will be a panel (div)
    Panel = #panel{id=Id},
    % the element_panel module is used to render the panel element
    element_panel:render(Id, Panel),
    % Or use the alternative method:
    Module = Panel#panel.module,
    Module:render(Id, Panel).
Notice that the records module attribute tells us what module we should call to render the element in the alternative method. In our case we will just hardcode the module since it's known to us. So now we have a basic element that renders a div with a temp id to our page. That's not terribly useful though. We actually need this element to render our msg, and with some events attached. Lets add the code to add our message to the panels contents.
Panel = #panel{id=Id, body=R#notify.msg},
element_panel:render(ControlId, Panel)
Now whatever is in the msg attribute of our notify record will be in the body of the panel when it gets rendered. All we need is a way to dismiss it. A link should do the trick. But now we have a slight problem. In order to add our dismiss link we need to add it to the body of the Panel. but the msg is already occupying that space. We could use a list and prepend the link to the end of the list for the body but that doesn't really give us a lot of control over styling the element. what we really need is for the msg to be in an inner panel and the outer panel will hold any controls the element needs.
Link = #link{text="dismiss"},
InnerPanel = #panel{body=R#notify.msg},
Panel = #panel{id=Id, body=[InnerPanel,Link]},
element_panel:render(ControlId, Panel)
Our link doesn't actually dismiss the notification yet though. To add that we need to add a click event to the link. Nitrogen has a large set of events and effects available. You can find them . We will be using the click event and the hide effect.
Event = #event{type=click,
actions=#hide{effect=blind, target=Id}},
Link = #link{text="dismiss", actions=Event},
Now our module should look something like this:
-module(element_notify).
-compile(export_all).
-include_lib("nitrogen/include/wf.inc").
-include("elements.hrl").
% give us a way to inspect the fields of this elements record
% useful in the shell where record_info isn't available
reflect() -> record_info(fields, notify).
% Render the custom element
render(ControlId, R) ->
    % get a temp id for our notify element instance
    Id = ControlId,
    % Our toplevel of the element will be a panel (div)
    Event = #event{type=click, actions=#hide{effect=blind, target=Id}},
    Link = #link{text="dismiss", actions=Event},
    InnerPanel = #panel{body=R#notify.msg},
    Panel = #panel{id=Id, body=[InnerPanel,Link]},
    % the element_panel module is used to render the panel element
    element_panel:render(Id, Panel).
This is a fully functional nitrogen element. But it's missing a crucial feature to really shine. Our third feature for this element was an optional expiration for the notification. Right now you have to click dismiss to get rid of the element on the page. But sometimes we might want the element to go away after a predetermined time. This is what our expire record field is meant to determine for us. There are three possible cases for this field.
  • set to false (the default)
  • set to some integer (the number of seconds after which we want to go away)
  • set to anything else (the error condition)
This is the kind of thing erlang's case statement was made for:
case R#notify.expire of
  false ->
    undefined;
  N when is_integer(N) ->
    % we expire in this many seconds
    wf:wire(Id, #event{type='timer', delay=N, actions=#hide{effect=blind, target=Id}});
  _ ->
    % log error and don't expire
    undefined
end
Notice the wf:wire statement. wf:wire is an alternate way to add events to a nitrogen element. Just specify the id and then the event record/javascript string you want to use. I've noticed that for events of type timer wf:wire works better than assigning them to the actions field of the event record. No idea why because I have not looked into it real closely yet. Now our module looks like this:
-module(element_notify).
-compile(export_all).
-include_lib("nitrogen/include/wf.inc").
-include("elements.hrl").
% give us a way to inspect the fields of this elements record
% useful in the shell where record_info isn't available
reflect() ->record_info(fields, notify).
% Render the custom element
render(_, R) ->
  % get a temp id for our notify element instance
  Id = ControlId,
  % Our toplevel of the element will be a panel (div)
  case R#notify.expire of
    false ->
      undefined;
    N when is_integer(N) ->
      % we expire in this many seconds
      wf:wire(Id, #event{type='timer', delay=N, actions=#hide{effect=blind, target=Id}});
    _ ->
      % log error and don't expire
      undefined
  end,
  Event = #event{type=click, actions=#hide{effect=blind, target=Id}},
  Link = #link{text="dismiss", actions=Event},
  InnerPanel = #panel{body=R#notify.msg},
  Panel = #panel{id=Id, body=[InnerPanel,Link]},
  % the element_panel module is used to render the panel element
  element_panel:render(ControlId, Panel).
We have now fulfilled all of our criteria for the element. It shows a message of our choosing. It can be dismissed with a click. And it has an optional expiration. One last thing to really polish it off though would to allow styling through the use of css classes. The ELEMENT_BASE macro we used in our record definition gives our element a class field. We can use that to set our Panel's class, allowing any user of the element to set the class as they wish like so:
Panel = #panel{id=Id, class=["notify ", R#notify.class],
body=[InnerPanel,Link]},
This gives us the final module for our custom element:
-module(element_notify).
-compile(export_all).
-include_lib("nitrogen/include/wf.inc").
-include("elements.hrl").
% give us a way to inspect the fields of this elements record
% useful in the shell where record_info isn't available
reflect() -> record_info(fields, notify).
  % Render the custom element
  render(_, R) ->
  % get a temp id for our notify element instance
  Id = ControlId,
  % Our toplevel of the element will be a panel (div)
  case R#notify.expire of
    false ->
      undefined;
    N when is_integer(N) ->
      % we expire in this many seconds
      wf:wire(Id, #event{type='timer', delay=N, actions=#hide{effect=blind, target=Id}});
    _ ->
      % log error and don't expire
      undefined
  end,
  Event = #event{type=click, actions=#hide{effect=blind, target=Id}},
  Link = #link{text="dismiss", actions=Event},
  InnerPanel = #panel{body=R#notify.msg},
  Panel = #panel{id=Id, class=["notify ", R#notify.class],
  body=[InnerPanel,Link]},
  % the element_panel module is used to render the panel element
  element_panel:render(ControlId, Panel).
I will cover delegated events and more advanced topics in a later tutorial.

Tags:
... Next>>