Tuesday, 21 October 2014

Parse large XML files in Ruby

In one of the projects I am working there was a requirement to parse a very large XML file (around 1.2 GB) in Ruby. Using the the traditional method of parsing wherein the XML file is loaded in memory and parsed was not a feasible approach for this.
So, I started exploring different methods for XML parsing and came across the libxml library.
Parsing using libxml is event based, that is, the parser reads the file line by line and looks for XML elements. When and element is encountered a event is fired. To parse the contents of the file, these events need to be handled.
To get started we need to install the following:
?
1
2
3
gem install libxml-ruby
sudo apt-get install libxml2
sudo apt-get install libxml2-dev libxslt1-dev
The structure of a typical program to parse using libxml is as follows:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
require ‘libxml’
include LibXML
 
class Parser
 include XML::SaxParser::Callbacks 
 
 def initialize
  # Constructor
 end
 
 def on_start_element(element, attributes)  
  # This event is fired when an start of an element is found.
 end
 
 def on_cdata_block(cdata)
  # This event is fired when a CDATA block is found.
 end
 
 def on_characters(chars)
  # This event is fired when characters are encountered between the start and end of an element.
 end
 
 def on_end_element(element)
  # This event is fired when an end of an element is found.
 end
 
end
 
parser = XML::SaxParser.file(“large_file.xml”)
parser.callbacks = Parser.new
parser.parse
Let us try this out with an example. For this example I am using the XML file from the following location:
I have saved the XML file as large_file.xml. As this is just an example I am using a small file, however, the above mentioned code will work for large files too without any change.
Sample from the XML file:
?
1
2
3
4
5
6
7
8
9
10
11
<catalog>
   <book id=”bk101”>
      <author>Gambardella, Matthew</author>
      <title>XML Developer’s Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>
So the code to parse the XML containing a book elements as shown above is as follows :
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
require ‘libxml’
include LibXML
 
class Parser
 include XML::SaxParser::Callbacks 
 
 def initialize
  #The Constructor
 end
 
 def on_start_element(element, attributes)
  if element.to_s == “catalog”
   puts “Catalog Started”
  end
 
  if element.to_s == “book”
   puts “ID : ” + attributes[“id”].to_s
  end
 
  if element.to_s == “author”
   @read_string = “”
  end
 
  if element.to_s == “title”
   @read_string = “”
  end
 
  if element.to_s == “genre”
   @read_string = “”
  end
 
  if element.to_s == “price”
   @read_string = “”
  end
 
  if element.to_s == “publish_date”
   @read_string = “”
  end
 
  if element.to_s == “description”
   @read_string = “”
  end
 end
 
 def on_cdata_block(cdata)
  puts “CDATA Found: ” + cdata.to_s
 end
 
 def on_characters(chars)
  if @read_string != nil
   @read_string = @read_string + chars
  end
 end
 
 def on_end_element(element)
  if element.to_s == “catalog”
   puts “Catalog Ended”
  end
 
  if element.to_s == “book”
   puts “n”
  end
 
  if element.to_s == “author”
   puts “Author :” + @read_string
   @read_string = nil
  end
 
  if element.to_s == “title”
   puts “Title :” + @read_string
   @read_string = nil
  end
 
  if element.to_s == “genre”
   puts “Genre :” + @read_string
   @read_string = nil
  end
 
  if element.to_s == “price”
   puts “Price :” + @read_string
   @read_string = nil
  end
 
  if element.to_s == “publish_date”
   puts “Publish Date :” + @read_string
   @read_string = nil
  end
 
  if element.to_s == “description”
   puts “Description :” + @read_string
   @read_string = nil
  end
 end
 
end
 
parser = XML::SaxParser.file(“large_file.xml”)
parser.callbacks = Parser.new
parser.parse
As you can see above how the event handlers are parsing the XML file element by element.
Sample output of the above code :
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Catalog Started
ID : bk101
Author :Gambardella, Matthew
Title :XML Developer’s Guide
Genre :Computer
Price :44.95
Publish Date :2000-10-01
Description :An in-depth look at creating applications 
      with XML.
 
ID : bk102
Author :Ralls, Kim
Title :Midnight Rain
Genre :Fantasy
Price :5.95
Publish Date :2000-12-16
Description :A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
 
ID : bk103
Author :Corets, Eva
Title :Maeve Ascendant
Genre :Fantasy
Price :5.95
Publish Date :2000-11-17
Description :After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
.....
.....
Catalog Ended
The code for the above can be found at the following location :
Hope this helps and let me know if you need further information.

No comments:

Post a comment

Related Posts Plugin for WordPress, Blogger...