Marzhill Musings

... Next>>

Did you ever need to index an xml doc

Published On: 2006-05-18 16:26:57
and preserve the xml information in the index? May I present "the XML Indexer". My brother, who's very populer AJAX Bible app has been getting attention, needed an xml index of the KJV Bible. He asked if I could help him get it. We would be parsing the KJV in XML format and I needed to pull out the reference information for every occurence of every word. Well I thought an xml indexer might be useful in more than one capacity and there wasn't much on the net or cpan with the capability to do it. It needed to be light and fast because it was going to be parsing the entire bible so a DOM parser was out of the question. So I wrote my own. is a module to index the words in an xml document and preserve the xml information about each occurence of the word. It's a little rough around the edges right now but it works. It uses the expat parser so it's light and fast. Look at the script for an example of how it works. I'll do a tutorial on it later. Update: This baby has been confirmed to parse the entire bible in Zaphania xml format in under 3 minutes. That is a 16 MB file. It spits out a 23 MB index in that space of time. Quite honestly it surprised me.


Using Reusable AJAX Gateways

Published On: 2005-12-02 15:43:31
So now I have a reusable ajax gateway. Just what exactly am I supposed to do with it? If you look around for a while you will start to notice everyone describing how you can use XSLT, SOAP, and all these other things to pass Objects back and forth. And again they all have suggestions for libraries you can use to do this in. But what if your not quite that ambitious? What if you wanted the speed and power and downright fun of using AJAX without all the huge libraries? Well as usuall I have an idea. You see what I really want to do with this is to retrieve pieces of html pages from the server to put into my current page. Simple enough right? Why I could just use cloneNode from the DOM api to do that. In fact if you looked at my example code from before you saw that I did exactly that. There's just one problem though. The cloned elements and test show up on your page alright but they aren't part of you html document. In fact the element don't obey any of your html rendering engines rules. It's as if you just went about making up fake tags to put in there. They don't do anything. What we need is a way to take our xml document and duplicate it's structure in our html document. 1 duplicate_nodes(); to the rescue!!! I wrote a small function that takes our html fragments (as I call them) and duplicates them in our pages document. Here is how I did it: 1 function duplicate_nodes(node) { // get our node type name and list of children 2 // loop through all the nodes and recreate them in our document 3 //alert('calling duplicate_nodes: ' + node.nodeName + ' type: ' + node.nodeType); 4 var newnode; 5 if (node.nodeType == 1) { 6 //alert('element mode'); 7 newnode = document.createElement(node.nodeName); 8 //alert('node added'); 9 newnode.nodeValue = node.nodeValue 10 //test for attributes 11 var attr = node.attributes; 12 var n_attr = attr.length 13 for (i = 0; i < n_attr; i++) { 14 newnode.setAttribute(attr.item(i).name, attr.item(i).nodeValue); 15 alert('added attribute: ' + attr.item(i).name + ' with a value of: ' + attr.item(i).nodeValue); 16 } 17 } else if (node.nodeType == 3 || node.nodeType == 4) { 18 //alert('text mode'); 19 try { 20 newnode = document.createTextNode(; 21 //alert('node added'); 22 } catch(e) { 23 alert('failed adding node'); 24 } 25 } while (node.firstChild) { 26 if (newnode) { 27 //alert('node has children'); 28 var childNode = duplicate_nodes(node.firstChild); 29 //alert ('back from recursive call with:' + childNode.nodeName); 30 newnode.appendChild(childNode); 31 node.removeChild(node.firstChild); 32 } 33 } 34 return newnode; 35 } Now this functions currently only handles elements, their attributes, and text or cdata nodes. entity and other node type support can be added easily however. Also I still need to do some testing on the attribute handling to see if it correctly handles stuff like eventhandlers and id attributes but it works. (Edit: It handles event handlers with no modification on firefox) Lets do like all good code hackers do and take it apart :-) Our first task in this function is to see what kind of node we are handling. This is contained the in the nodeType property of the node object. When this is a 1 it's an element. When it's a 3 or 4 it's CDATA or a Text node. Thus our if statements: 1 if (node.nodeType == 1) { } else if (node.nodeType == 3 || node.nodeType == 4) { } Elements and Text or CDATA have to be handled very differently so we check for these two types before doing anything else. In the case of an element node (type 1) we need two more peices of information: 1 node.nodeName and 1 node.nodeValue These provide us with the details we need when recreating our element in the html document. They are pretty well self explanatory one is the name or tagName of the element and the other is the elements value. Now we are ready to start creating our new element in the current document like so: 1 newnode = document.createElement(node.nodeName); 2 //alert('node added'); 3 newnode.nodeValue = node.nodeValue Now how do we handle it's attributes? A simple for loop will do that for us. the attributes property gives us a list of the nodes attributes. The calling the length property for that list gives us how many attributes there are. And the for loop loops through each one duplicating it in our newnode like so: 1 //test for attributes var attr = node.attributes; 2 var n_attr = attr.length 3 for (i = 0; i < n_attr; i++) { 4 newnode.setAttribute(attr.item(i).name, attr.item(i).nodeValue); 5 alert('added attribute: ' + attr.item(i).name + ' with a value of: ' + attr.item(i).nodeValue); 6 } And that's all we need to recreate our element and its attributes. Text nodes are even easier to handle. you just need one piece of information for them. The data property. create a new text node using the document.createTextNode method with the property and your good to go: 1 //alert('text mode'); 2 try { 3 newnode = document.createTextNode(; 4 //alert('node added'); 5 } catch(e) { 6 alert('failed adding node'); 7 } There is just one last thing to take care of though. What if our node has children? What do you do then? Function Recursion to the rescue!! The firstChild property of a node will tell us if there are any children and a while loop will keep looping as long as it returns true. All we have to do is:
  • call duplicate_nodes recursively with that child as an argument
  • append the returned node to the newnode
  • remove each child from the node
  • and keep looping till no more children exist
Here is the while loop: 1 while (node.firstChild){ 2 if (newnode) { 3 //alert('node has children'); 4 var childNode = duplicate_nodes(node.firstChild); 5 //alert ('back from recursive call with:' + childNode.nodeName); 6 newnode.appendChild(childNode); 7 node.removeChild(node.firstChild); 8 } 9 } The last task of our function is to return the duplicated node 1 return newnode; our duplicate function does not append the node anywhere in our document so it won't show up. That is the job of the calling function. It can append the new node where ever it wants.


Reuseable AJAX gateways

Published On: 2005-11-28 16:13:20
Everyone knows about AJAX these days. You just about can't go anywhere on the net whithout hearing about it. And if you're a coder who want's to know more than just what library you should download to start using it you've probably done a little googling and came up with this site: XMLHttpRequest Objects [] You even played around with the examples and made a few demo apps then realized. Hey!! How can I make these things reusable without ugly global variables and functions that check to see if the response came back yet? In short: how do I use this in a real app? Apple has done a really good job of showing how the xmlhttprequest object works. They even do a good job of showing some useful ways to use it. But if you're like me you want to go a bit farther. I like reusability. I also don't like using Global variables as a gatekeeper. So lets take a look at how we can make this code a little more reusable. The first thing to do is come up with a way to use multiple different functions as the handler for that onreadystate property. Using the same handler really cramps our style. Additionally having to write all that code to test our object's state is a real drag. It would be nice if we could avoid having to write that for every single function we use as a handler. Here is the solution: Let's start with this function here: 1 2 function loadXMLDoc(url) { 3 req = false; 4 // branch for native XMLHttpRequest object 5 if(window.XMLHttpRequest) { 6 try { 7 req = new XMLHttpRequest(); 8 } catch(e) { 9 req = false; 10 } 11 // branch for IE/Windows ActiveX version 12 } else if(window.ActiveXObject) { 13 try { 14 req = new ActiveXObject("Msxml2.XMLHTTP"); 15 } catch(e) { 16 try { 17 req = new ActiveXObject("Microsoft.XMLHTTP"); 18 } catch(e) { 19 req = false; 20 } 21 } 22 } 23 if(req) { 24 req.onreadystatechange = processReqChange; 25"GET", url, true); 26 req.send(""); 27 } 28 } Now for this to do what we really need it to we need a couple of different things. That processReqChange function needs to be able to change dynamically. So lets add another function argument that will hold a function passed in to be used here. Like so: 1 loadXMLDoc(url, func) then you can change 1 req.onreadystatechange = processReqChange; to 1 req.onreadystatechange = func; This will allow us to pass any function we want as the state change handler. Don't go deleting that processReqChange function yet though. We still need it. In fact lets take a look at that one right now shall we? 1 2 function processReqChange() { 3 // only if req shows "loaded" 4 if (req.readyState == 4) { 5 // only if "OK" 6 if (req.status == 200) { 7 // ...processing statements go here... 8 } else { 9 alert("There was a problem retrieving the XML data:\n" + req.statusText); 10 } 11 } 12 } We need this to keep checking our state and tell us when our response came back. We also need it to use any xmlhttprequest object we want it to. What we don't need it to do is retrieve our response for us. In short we need it to recieve a request object in it's arguments and return a response saying it's ok to process our response. So lets modify it a little shall we? 1 2 function processReqChange(req) { 3 // only if req shows "loaded" 4 if (req.readyState == 4) { 5 // only if "OK" 6 if (req.status == 200) { 7 return 1; 8 // it's safe now go ahead 9 } else { 10 alert("There was a problem retrieving the XML data:\n" + req.statusText); 11 } 12 } 13 return 0; 14 //it's not safe yet 15 } now when we pass this function a request object it returns 1 when we have our response and 0 when the response is not ready yet. Both of these functions are now reusable. But how exactly do we start using them? I thought you would never ask. lets build an example: 1 2 function append_to_id(el, contents) { 3 var element = document.getElementById(el); 4 //alert('appending: ' + contents.nodeValue ); 5 element.appendChild(contents); 6 } 7 function append(url, el) { 8 //alert('starting append operation'); 9 var func = function() { 10 if (processReqChange(req)) { 11 var ajax_return = req.responseXML; 12 while (ajax_return.hasChildNodes()) { 13 append_to_id(el, ajax_return.firstChild); 14 ajax_return.removeChild(ajax_return.firstChild); 15 } 16 } 17 } 18 var req= loadXMLDoc(url, func); 19 } In the append function we create a dynamic function that we can pass to our loadXMLDoc function. That dynamic function contains the meat of what we are wanting to do. It uses an if statement that checks our processReqChange function for a valid return. When it gets a valid return the if statement processes our request. It couldn't be any eaiser. you can see full example code here: Example Script

... Next>>