Tuesday, 21 October 2014

Harpoon helps in facing your data: HDFS is not a data warehouse.

This is the first in a series of blog posts written to give you a better technical understanding of our product Harpoon.
Based on our practical experiences, we will show what, in our opinion, are the typical problems people face when using Hadoop in the context of business intelligence on top of massive amounts of data. Let’s start.
Hadoop, as you may know, is a complex eco-system consisting of various platforms and tools meant mainly for acquiring and analysing big data sets. The most fundamental component of Hadoop is HDFS (Hadoop Distributed File System). It’s one of the cheapest ways to reliably store vast amounts of data. It’s basically a distributed file system that provides the abstraction of a global file system on top of the local file systems of the nodes that form a Hadoop cluster..
Using a bunch of relatively inexpensive servers connected together to form a cluster, it is possible to store terabytes of data without incurring the typical costs of big storage systems. It implements a data replication mechanism which allows it to deal with the loss of one or more cluster nodes (you cannot imagine how often this happens in a big cluster). Even if it doesn’t provide all the functionalities of a typical storage system,  it does excel in streaming data which is the most common use case in data analysis and business intelligence.
Given all those characteristics, HDFS is becoming more and more popular as a sort of “universal” repository for storing data coming from different sources. Tools like SqoopFlume help in collecting data from RDBMS or any other source into this big pot called HDFS. At the other end, HDFS is more than good enough, especially taking into the account the cost savings, when it comes to performing complex business intelligence tasks on top of data where full OLTP capabilities aren’t required i.e the tasks that are performed most of the time.
Putting together a simple prototype to collect some data and do something meaningful using the Hadoop platform is pretty easy. Tools like the Cloudera Manager greatly simplify the setting up and the maintenance of a Hadoop cluster with all the services they provide.
So far so good; however, HDFS is just a file system, it doesn’t provide any mechanism for keeping the data you put into it ordered and catalogued. HDFS is not a database, it just provides simple abstractions like directories and files. Everything relies on discipline and best practices enforced by good people. We have observed companies fall in love with Hadoop (understandably, Hadoop is an amazing platform!), and start to use it, consolidating many different kinds of data coming from different data sources. Can you guess what happens after a while? A total mess! Different teams start to organise the data in different ways, there is no idea of data history, format, etc., and that’s not to mention all the security related problems. Frankly speaking, this is pretty normal. What do you expect if you give multiple users free access to a plain file system? Think what happens to a wiki, a tree-based document management system, that doesn’t impose any particular structure. Think how painful it is to keep an enterprise wiki clean and well organised.
This problem is well understood and there are attempts to solve it. Hive, for example, a popular SQL engine that generates Map/Reduce jobs, provides a meta-data repository and organises the data in terms of databases and tables (of course). It’s very nice and allows access to the data even using a low-latency SQL engine like Impala or the fast Hive engine based on Tez. However, this comes with a price, you are forced to use a specific set of tools, the interoperability among the different high-level tools like Pig and Hive is limited and allows you to work only on a subset of data types. A tool like HCatalog adds some meta-data support, but it’s still in its infancy and doesn’t fully solve the aforementioned interoperability problems.
During the design of Harpoon we took into account all these problems and we implemented a set of features specifically meant to mitigate them:
    1. Simple data organization modelHarpoon imposes a strict way to organise your data by offering a simple to understand model very similar to Hive’s one: data is organised in databases as a collection of tables. A table contains a list of records where each field has a name and a type. This organisation imposes a very simple layout in HDFS: a directory per database and a directory per table.
    2. Powerful meta-data repositoryAll the data schemas are collected and stored in the meta-data repository of Harpoon. Harpoon keeps track of the names and types of all the entities stored inside its repository. It also stores the actual data format of a table e.g Avro, Parquet or something else.
    3. Support for multiple data formatsIt is extremely easy to extend Harpoon to support additional data formats. It also keeps track of the lineage of data, security information like read/write rights, data visibility and data ownership.
    4. Homogenization of data typesHarpoon achieves the interoperability among the different tools by enforcing common formats for specific data types like date and timestamps. For example, even if the date data type is not supported by Avro, a date is stored in Avro as a string in a format compatible with Pig, Hive and Impala at the same time. The meta-data repository, then, keeps track of the fact that this specific field is a actually a date and not a string.
    5. Automatic support for data format conversionThe data is always stored in a self describing format, we currently support Avro but we will support Parquet soon. In this way it’s possible to use any tool to access the data stored inside Harpoon without the need to access Harpoon’s meta-data repository. Harpoon also provides an automatic mechanism to convert data from one format to another.
Collectively all those features are meant to mitigate the problem of using the plain HDFS for storing data. At a cost of being forced to use a simpler data organization model, Harpoon automatically takes control of the data layout on the file system,  keeps track of the different data formats and ensures that different query tools can access the same data sets transparently, all run under a strict control of a powerful authentication and authorisation system that extends what Hadoop is currently offering.
In the next post I’ll talk a bit more about the different tools Hadoop provides for data analysis and manipulation, their pros and cons and how they are evolving. I’ll show also how Harpoon greatly simplifies the usage of those tools eliminating the interoperability problems among them by providing a unified view of the data.
Stay tuned.
This is the first in a series of blog posts written to give you a better technical understanding of our product Harpoon.
Based on our practical experiences, we will show what, in our opinion, are the typical problems people face when using Hadoop in the context of business intelligence on top of massive amounts of data. Let’s start.
Hadoop, as you may know, is a complex eco-system consisting of various platforms and tools meant mainly for acquiring and analysing big data sets. The most fundamental component of Hadoop is HDFS (Hadoop Distributed File System). It’s one of the cheapest ways to reliably store vast amounts of data. It’s basically a distributed file system that provides the abstraction of a global file system on top of the local file systems of the nodes that form a Hadoop cluster..
Using a bunch of relatively inexpensive servers connected together to form a cluster, it is possible to store terabytes of data without incurring the typical costs of big storage systems. It implements a data replication mechanism which allows it to deal with the loss of one or more cluster nodes (you cannot imagine how often this happens in a big cluster). Even if it doesn’t provide all the functionalities of a typical storage system,  it does excel in streaming data which is the most common use case in data analysis and business intelligence.
Given all those characteristics, HDFS is becoming more and more popular as a sort of “universal” repository for storing data coming from different sources. Tools like SqoopFlume help in collecting data from RDBMS or any other source into this big pot called HDFS. At the other end, HDFS is more than good enough, especially taking into the account the cost savings, when it comes to performing complex business intelligence tasks on top of data where full OLTP capabilities aren’t required i.e the tasks that are performed most of the time.
Putting together a simple prototype to collect some data and do something meaningful using the Hadoop platform is pretty easy. Tools like the Cloudera Manager greatly simplify the setting up and the maintenance of a Hadoop cluster with all the services they provide.
So far so good; however, HDFS is just a file system, it doesn’t provide any mechanism for keeping the data you put into it ordered and catalogued. HDFS is not a database, it just provides simple abstractions like directories and files. Everything relies on discipline and best practices enforced by good people. We have observed companies fall in love with Hadoop (understandably, Hadoop is an amazing platform!), and start to use it, consolidating many different kinds of data coming from different data sources. Can you guess what happens after a while? A total mess! Different teams start to organise the data in different ways, there is no idea of data history, format, etc., and that’s not to mention all the security related problems. Frankly speaking, this is pretty normal. What do you expect if you give multiple users free access to a plain file system? Think what happens to a wiki, a tree-based document management system, that doesn’t impose any particular structure. Think how painful it is to keep an enterprise wiki clean and well organised.
This problem is well understood and there are attempts to solve it. Hive, for example, a popular SQL engine that generates Map/Reduce jobs, provides a meta-data repository and organises the data in terms of databases and tables (of course). It’s very nice and allows access to the data even using a low-latency SQL engine like Impala or the fast Hive engine based on Tez. However, this comes with a price, you are forced to use a specific set of tools, the interoperability among the different high-level tools like Pig and Hive is limited and allows you to work only on a subset of data types. A tool like HCatalog adds some meta-data support, but it’s still in its infancy and doesn’t fully solve the aforementioned interoperability problems.
During the design of Harpoon we took into account all these problems and we implemented a set of features specifically meant to mitigate them:
    1. Simple data organization modelHarpoon imposes a strict way to organise your data by offering a simple to understand model very similar to Hive’s one: data is organised in databases as a collection of tables. A table contains a list of records where each field has a name and a type. This organisation imposes a very simple layout in HDFS: a directory per database and a directory per table.
    2. Powerful meta-data repositoryAll the data schemas are collected and stored in the meta-data repository of Harpoon. Harpoon keeps track of the names and types of all the entities stored inside its repository. It also stores the actual data format of a table e.g Avro, Parquet or something else.
    3. Support for multiple data formatsIt is extremely easy to extend Harpoon to support additional data formats. It also keeps track of the lineage of data, security information like read/write rights, data visibility and data ownership.
    4. Homogenization of data typesHarpoon achieves the interoperability among the different tools by enforcing common formats for specific data types like date and timestamps. For example, even if the date data type is not supported by Avro, a date is stored in Avro as a string in a format compatible with Pig, Hive and Impala at the same time. The meta-data repository, then, keeps track of the fact that this specific field is a actually a date and not a string.
    5. Automatic support for data format conversionThe data is always stored in a self describing format, we currently support Avro but we will support Parquet soon. In this way it’s possible to use any tool to access the data stored inside Harpoon without the need to access Harpoon’s meta-data repository. Harpoon also provides an automatic mechanism to convert data from one format to another.
Collectively all those features are meant to mitigate the problem of using the plain HDFS for storing data. At a cost of being forced to use a simpler data organization model, Harpoon automatically takes control of the data layout on the file system,  keeps track of the different data formats and ensures that different query tools can access the same data sets transparently, all run under a strict control of a powerful authentication and authorisation system that extends what Hadoop is currently offering.
In the next post I’ll talk a bit more about the different tools Hadoop provides for data analysis and manipulation, their pros and cons and how they are evolving. I’ll show also how Harpoon greatly simplifies the usage of those tools eliminating the interoperability problems among them by providing a unified view of the data.
Stay tuned.
Related Posts Plugin for WordPress, Blogger...