Hadoop Tutorials: Ingesting XML in Hive using XPath
In the first of my series of Hadoop tutorials, I wanted to share an interesting case that arose when I was experiencing poor performance trying to do queries and computations on a set of XML Data. These computations could be mathematical as well as statistical for which the data needed to be ingested into a platform that could handle huge amounts of data and could be easily queried. The current tool used for processing this data was too expensive and slow because of which we needed to come up with a less expensive solution that was more cost effective.
Since we are already into the Hadoopworld we decided to use either Hive or Pig. This would be cost effective as well as yield good performance since it would benefit from Hadoop’s distributed storage and processing. The end users were more comfortable with SQL so we decided to go with Hive. XML can be ingested directly into Hive using XPath but the problem arises when you have a few hundred fields for which you need to generate XPath tags. Even though XPath is an excellent way to read from a XML file the user still has to manually specify every tag that is to be read.
The solution was to have a piece of code that would go through a part of the XML file containing few records and spit out XPath tags. Each XML tag can have multiple tags within it for we had to loop into tags and maintain counters on the parent as well as child tags. In any XML file there are also some parent tags that do not hold any value but just encapsulate other child tags that hold values and need to be separated.
Here we have policy as the parent tag within which we have policyLimit .. vehicleCoverage. vehicleCoverage is a child to policy but parent to other tags such as coverageLimit .. coverageCode. We need to make sure that none of these parent tags show up in our final XPath list. We also need to specify the base node tag for our input XML which would be policy in this case.
Here is sample java code that iterates through the XML file and handles the following:
HashSet<String> unwantedParentTags = new HashSet<String>();
Hashtable<String, Integer> nodeCounters= new Hashtable<String, Integer>();
String currentNode = null;
Integer currentNodeCounter = 0;
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Fully qualified XPath’s in the format basenode/childnode
The code shown above only works at the first level / loop while iterating through the XML and can be modified to recursively loop through each level. Once we have a complete list of XPath expressions as well as unwanted parent tags, the next step is to iterate through the entire tag list and remove unwanted parent tags. Now we have our final list of XPath tags the only step left is to write the Hive Script to read from our XML file. We must pre-process the raw XML into a set of Hive friendly newline terminated XML records, cleansing embedded newlines and other formatting. The source XML in our case contains formatting whitespace and newlines for readability. We delete literal ampersands as well as remove all whitespace and newlines, then insert a newline at the end of each record level tag.
CREATE TABLE xpath_table_final AS SELECT * FROM xpath_view;
SELECT * FROM xpath_table_final WHERE coveragecode = ‘ABC’
Also, retaining the original source XML allows us to create specific XPath views to fulfill different requirements. We ran multiple queries on the resultant table which included joins, averages, sums etc and it gave us the desired output. The java code to generate XPath tags takes only a few seconds since it works on a very small set but the table creation all depends on the cluster configuration.