Big Data technologies have been getting lot of attention over the last few years. There are several trends and innovations happening in this space. InfoQ would like to learn what new trends in Big Data you are currently using or planning on using in the future.
Streaming Big Data analytics
Storm:ApacheStormis an open source distributed real-time computation system. Storm makes it easy to process streams of data, doing for real-time processing what Hadoop did for batch processing.
Spark:Sparkis an in-memory data-processing platform that is compatible with Hadoop data sources but runs much faster than Hadoop MapReduce. It’s well suited for machine learning jobs, as well as interactive data queries, and is easier for many developers because it includes APIs in Scala, Python and Java.
Twitter's Summingbird:Summingbirdis a library that lets you write streaming MapReduce programs and execute them on distributed MapReduce platforms like Storm and Scalding.
AWS Kinesis:Amazon Kinesisis a managed service for real-time processing of streaming data. It can collect and process large data from several different sources, allowing you to write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.
DataTorrent:DataTorrentis a real-time streaming platform that enables businesses to perform data processing or transformations on structured or unstructured data, in real-time as the data is streaming into the data center. The product leverages Hadoop 2.0 and YARN technologies.
Spring XD:Spring XDframework supports streams for the ingestion of event driven data from a source to a sink that passes through any number of processors. The streams are backed bySpring Integrationadapters.
Big Data (Hadoop) as a Service
Elastic MapReduce:Amazon Elastic MapReduce (Amazon EMR) is a web service that that can be used to process large amounts of data. It uses Hadoop to distribute the data and processing across a resizable cluster of Amazon EC2 instances.
Qubole:Qubole's Big Data as a Service provides a Hadoop cluster with built-in data connectors and a graphical editor for the Big Data projects.
Mortar:Mortaris a general-purpose platform for high-scale data science. It's built on the Amazon Web Services cloud, using Elastic MapReduce (EMR) to launch Hadoop clusters and process large data sets. Mortar runs Apache Pig, a data flow language built on top of Hadoop. Mortar runs on open-source technologies like Hadoop, Pig, Java, Jython, and Luigi to let the users focus on the data science without worrying about IT infrastructure.
Rackspace:WithRackspace Hadoop clusters, you can run Hadoop on Rackspace managed dedicated servers, spin up Hadoop on the public cloud, or configure your own private cloud.
Joyent:Joyent Solution for Hadoopis a cloud-based hosting environment for your big data projects based on Apache Hadoop. It provides the data storage services to capture, analyze and access data in any format, data management services to process, monitor and operate Hadoop, and data platform services to secure, archive and scale for consistent availability.
Apache Hive:Apache Hivefacilitates querying and managing large datasets residing in distributed storage. It also allows the map reduce programmers to plug in custom mappers and reducers.
Impala:Cloudera’sImpalais an open source massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop. It enables users to directly query data stored in HDFS and Apache HBase without requiring data movement or transformation.
Shark:Sharkis a data warehouse system for Spark designed to be compatible with Apache Hive. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions.
Spark SQL:Spark SQLallows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. Spark SQL is currently an alpha component.
Apache Drill:Apache Drill, currently an Apache incubation project. provides ad-hoc queries to different data sources, including nested data. Inspired by Google's Dremel, Drill is designed for scalability and the ability to query large sets of data. This project is backed by MapR.
Apache Tajo:Apache Tajois a big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
Presto:Prestoframework from Facebook, is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes.
Phoenix:Phoenix, from Salesforce, is an open source SQL query engine for Apache HBase and is accessed as a JDBC driver and enables querying and managing HBase tables using SQL. It was submitted as aproposalto become an Apache Incubator project.
Pivotal's HAWQ:HAWQ, part of Pivotal'sBig Data Suite, is a MPP SQL processing engine optimized for analytics with full transaction support. It breaks complex queries into small tasks and distributes them to MPP query processing units for execution.
Big Data Lambda Architecture
TheLambda Architecture(LA) provides a hybrid platform by combining real-time data and data pre-computed by the Hadoop environment together to provide a near-real time view of the data at all times.Lambda Architecture frameworksinclude the following: