Marzhill Musings

... Next>>

Did you ever need to index an xml doc

Published On: 2006-05-18 16:26:57
and preserve the xml information in the index? May I present "the XML Indexer". My brother, who's very populer AJAX Bible app has been getting attention, needed an xml index of the KJV Bible. He asked if I could help him get it. We would be parsing the KJV in XML format and I needed to pull out the reference information for every occurence of every word. Well I thought an xml indexer might be useful in more than one capacity and there wasn't much on the net or cpan with the capability to do it. It needed to be light and fast because it was going to be parsing the entire bible so a DOM parser was out of the question. So I wrote my own. xml_indexer.pm is a module to index the words in an xml document and preserve the xml information about each occurence of the word. It's a little rough around the edges right now but it works. It uses the expat parser so it's light and fast. Look at the bible_index.pl script for an example of how it works. I'll do a tutorial on it later. Update: This baby has been confirmed to parse the entire bible in Zaphania xml format in under 3 minutes. That is a 16 MB file. It spits out a 23 MB index in that space of time. Quite honestly it surprised me.

Tags:

First Draft of the Bricklayer Documentation

Published On: 2005-12-07 23:44:39
I just finished the first draft of the Bricklayer development manual. You can see it here: Bricklayer Manual Take a look and tell me if you see any thing that might need more clarification or spelling correction.

Tags:

Using Reusable AJAX Gateways

Published On: 2005-12-02 15:43:31
So now I have a reusable ajax gateway. Just what exactly am I supposed to do with it? If you look around for a while you will start to notice everyone describing how you can use XSLT, SOAP, and all these other things to pass Objects back and forth. And again they all have suggestions for libraries you can use to do this in. But what if your not quite that ambitious? What if you wanted the speed and power and downright fun of using AJAX without all the huge libraries? Well as usuall I have an idea. You see what I really want to do with this is to retrieve pieces of html pages from the server to put into my current page. Simple enough right? Why I could just use cloneNode from the DOM api to do that. In fact if you looked at my example code from before you saw that I did exactly that. There's just one problem though. The cloned elements and test show up on your page alright but they aren't part of you html document. In fact the element don't obey any of your html rendering engines rules. It's as if you just went about making up fake tags to put in there. They don't do anything. What we need is a way to take our xml document and duplicate it's structure in our html document. 1 duplicate_nodes(); to the rescue!!! I wrote a small function that takes our html fragments (as I call them) and duplicates them in our pages document. Here is how I did it: 1 function duplicate_nodes(node) { // get our node type name and list of children 2 // loop through all the nodes and recreate them in our document 3 //alert('calling duplicate_nodes: ' + node.nodeName + ' type: ' + node.nodeType); 4 var newnode; 5 if (node.nodeType == 1) { 6 //alert('element mode'); 7 newnode = document.createElement(node.nodeName); 8 //alert('node added'); 9 newnode.nodeValue = node.nodeValue 10 //test for attributes 11 var attr = node.attributes; 12 var n_attr = attr.length 13 for (i = 0; i < n_attr; i++) { 14 newnode.setAttribute(attr.item(i).name, attr.item(i).nodeValue); 15 alert('added attribute: ' + attr.item(i).name + ' with a value of: ' + attr.item(i).nodeValue); 16 } 17 } else if (node.nodeType == 3 || node.nodeType == 4) { 18 //alert('text mode'); 19 try { 20 newnode = document.createTextNode(node.data); 21 //alert('node added'); 22 } catch(e) { 23 alert('failed adding node'); 24 } 25 } while (node.firstChild) { 26 if (newnode) { 27 //alert('node has children'); 28 var childNode = duplicate_nodes(node.firstChild); 29 //alert ('back from recursive call with:' + childNode.nodeName); 30 newnode.appendChild(childNode); 31 node.removeChild(node.firstChild); 32 } 33 } 34 return newnode; 35 } Now this functions currently only handles elements, their attributes, and text or cdata nodes. entity and other node type support can be added easily however. Also I still need to do some testing on the attribute handling to see if it correctly handles stuff like eventhandlers and id attributes but it works. (Edit: It handles event handlers with no modification on firefox) Lets do like all good code hackers do and take it apart :-) Our first task in this function is to see what kind of node we are handling. This is contained the in the nodeType property of the node object. When this is a 1 it's an element. When it's a 3 or 4 it's CDATA or a Text node. Thus our if statements: 1 if (node.nodeType == 1) { } else if (node.nodeType == 3 || node.nodeType == 4) { } Elements and Text or CDATA have to be handled very differently so we check for these two types before doing anything else. In the case of an element node (type 1) we need two more peices of information: 1 node.nodeName and 1 node.nodeValue These provide us with the details we need when recreating our element in the html document. They are pretty well self explanatory one is the name or tagName of the element and the other is the elements value. Now we are ready to start creating our new element in the current document like so: 1 newnode = document.createElement(node.nodeName); 2 //alert('node added'); 3 newnode.nodeValue = node.nodeValue Now how do we handle it's attributes? A simple for loop will do that for us. the attributes property gives us a list of the nodes attributes. The calling the length property for that list gives us how many attributes there are. And the for loop loops through each one duplicating it in our newnode like so: 1 //test for attributes var attr = node.attributes; 2 var n_attr = attr.length 3 for (i = 0; i < n_attr; i++) { 4 newnode.setAttribute(attr.item(i).name, attr.item(i).nodeValue); 5 alert('added attribute: ' + attr.item(i).name + ' with a value of: ' + attr.item(i).nodeValue); 6 } And that's all we need to recreate our element and its attributes. Text nodes are even easier to handle. you just need one piece of information for them. The data property. create a new text node using the document.createTextNode method with the node.data property and your good to go: 1 //alert('text mode'); 2 try { 3 newnode = document.createTextNode(node.data); 4 //alert('node added'); 5 } catch(e) { 6 alert('failed adding node'); 7 } There is just one last thing to take care of though. What if our node has children? What do you do then? Function Recursion to the rescue!! The firstChild property of a node will tell us if there are any children and a while loop will keep looping as long as it returns true. All we have to do is:
  • call duplicate_nodes recursively with that child as an argument
  • append the returned node to the newnode
  • remove each child from the node
  • and keep looping till no more children exist
Here is the while loop: 1 while (node.firstChild){ 2 if (newnode) { 3 //alert('node has children'); 4 var childNode = duplicate_nodes(node.firstChild); 5 //alert ('back from recursive call with:' + childNode.nodeName); 6 newnode.appendChild(childNode); 7 node.removeChild(node.firstChild); 8 } 9 } The last task of our function is to return the duplicated node 1 return newnode; our duplicate function does not append the node anywhere in our document so it won't show up. That is the job of the calling function. It can append the new node where ever it wants.

Tags:
... Next>>