In one of the projects I am working there was a requirement to parse a very large XML file (around 1.2 GB) in Ruby. Using the the traditional method of parsing wherein the XML file is loaded in memory and parsed was not a feasible approach for this.
So, I started exploring different methods for XML parsing and came across the libxml library.
Parsing using libxml is event based, that is, the parser reads the file line by line and looks for XML elements. When and element is encountered a event is fired. To parse the contents of the file, these events need to be handled.
To get started we need to install the following:
The structure of a typical program to parse using libxml is as follows:
Let us try this out with an example. For this example I am using the XML file from the following location:
I have saved the XML file as large_file.xml. As this is just an example I am using a small file, however, the above mentioned code will work for large files too without any change.
Sample from the XML file:
So the code to parse the XML containing a book elements as shown above is as follows :
As you can see above how the event handlers are parsing the XML file element by element.
Sample output of the above code :
The code for the above can be found at the following location :
Hope this helps and let me know if you need further information.