Today, recommendations are everywhere online. Major e-commerce websites like Amazon provide product recommendations in many different forms across their web properties. Financial planning sites like Mint.com provide recommendations for things like credit cards that a user might want to sign up for or banks that can offer better interest rates. Google augments search results based on its knowledge of the users’ past searches to find the most relevant results.
These brands use recommendations to provide contextual, relevant user experience in order to increase conversion rates and user satisfaction. Traditionally, these sorts of recommendations have been computed by batch processes that generate new recommendations on a nightly, weekly or even monthly basis.
However, for certain types of recommendations, it’s necessary to react in a much shorter timeframe than batch processing allows, such as offering a consumer a geo-location-based recommendation. Consider a movie recommendation system -- If a user historically watches action movies, but is currently searching for a comedy, batch recommendations will likely result in recommendations for more action movies instead of the most relevant comedy. In this article, you will learn how to use the Kiji framework, an open source framework for building Big Data Applications, to build a system that provides real-time recommendations.
Kiji, Entity-Centric Data, and the 360º View
To build a real-time recommendation system, we first need a system that can be used to store a360º viewof our customers. Moreover, we need to be able to retrieve data about a particular customer quickly in order to produce recommendations as they interact with our website or mobile app. Kiji is an open-source, modular framework for building real-time applications that collect, store and analyze this sort of data.
More generally, the data necessary for a 360º view can be termed entity-centric data. An entity could be any number of things such as a customer, user, account, or something more abstract like a point-of-sale system or a mobile device.
The goal of anentity-centric storage systemis to be able to store everything about a particular entity in a single row. This is challenging with traditional, relational databases because the information may be both stateful data (like name, email address, etc.) and event streams (like clicks). A traditional system requires storing this data in multiple tables, which get joined together at processing time, which makes it harder to do real-time processing. To deal with this challenge, Kiji leveragesApache HBase, which stores data in four dimensions -- row, column family, column qualifier, and timestamp. By leveraging the timestamp dimension, and the ability of HBase to store multiple versions of a cell, Kiji is able to store event-stream data alongside the more stateful, slowly-changing data.
HBase is a key-value store built on top of HDFS and used byApache Hadoop, which provides the scalability that is necessary for a Big Data solution. A large challenge with developing applications on HBase is that it requires that all the data going in and out of the system be byte arrays. To deal with this, the final core component of Kiji isApache Avro, which is used by Kiji to store easily-processed data types like standard strings and integers, as well as more complex user-defined data types. Kiji handles any necessary serialization and deserialization for the application when reading or writing data.
Developing Models for Use in Real Time
Kiji provides two APIs for developing models, in Java or Scala, both of which have a batch and a real-time component. The purpose of this split is to break down a model into distinct phases of model execution. The batch phase is a training phase, which is typically a learning process, in which the model is trained over a dataset for the entire population. The output of this phase might be things like parameters for a linear classifier or locations of clusters for a clustering algorithm or a similarity matrix for relating items to one another in a collaborative filtering system. The real-time phase is known as the scoring phase, and takes the trained model and combines it with an entity’s data to produce derived information. Critically, this derived data is considered first-class, in that it can be stored back in the entity’s row for use in serving recommendations or for use as input in later computations.
The Java APIs are called KijiMR, and the Scala APIs form the core of a tool called KijiExpress. KijiExpress leverages the Scalding library to provide APIs for building complex MapReduce workflows, while avoiding a significant amount of boilerplate code typically associated with Java, as well as the job scheduling and coordination that is necessary for stringing together MapReduce jobs.
Individuals Versus Populations
The reason for the differentiation between batch training and real-time scoring is that Kiji makes the observation that population trends change slowly, while individual trends change quickly.
Consider a dataset for a user population that contains ten million purchases. One more purchase is not likely to dramatically affect trends for the population and their likes or dislikes. However, if a particular user has only ever made ten purchases, the eleventh purchase will have a huge affect on what a system can determine that the user is interested in. Given this assertion, an application will only need to retrain its model once enough data has been gathered to affect the population trends. However, we can improve recommendation relevancy for an individual user by reacting to their behavior in real time.
Scoring Against a Model in Real Time
In order to score in real time, the KijiScoring module provides a lazy computation system that allows an application to generate refreshed recommendations only for users that are actively interacting with the application. Through lazy computation, Kiji applications can avoid generating recommendations for users that don’t frequently or may never return for a second visit. This also has the added benefit that Kiji can take into account contextual information like the location of their mobile device at the time of the recommendation.
The primary component in KijiScoring is called a Freshener. Fresheners are really a combination of a couple of other Kiji components: ScoringFunctions and FreshnessPolicies. As mentioned earlier, a model will consist of both a training and a scoring phase. The ScoringFunction is the piece of code that describes how a trained model and a single entity’s data are combined to produce a score or recommendations. A FreshnessPolicy defines when data becomes stale or out-of-date. For example, a common FreshnessPolicy will say that data is out-of-date when it is older than an hour or so. A more complex policy might mark data as out-of-date once an entity has experienced some number of events, like clicks or product views. Finally, the ScoringFunction and FreshnessPolicy are attached to a particular column in a Kiji table which will trigger a refresh of the data, if necessary.
Applications that do real-time scoring will include a tier of servers called KijiScoring servers, which fill the role of an execution layer for refreshing stale data. When a user interacts with the application, the request will be passed to the KijiScoring server tier, which communicates directly with the HBase cluster. The KijiScoring server will request the data, and once retrieved, determine whether or not the data is up-to-date, according to the FreshnessPolicy. If the data is up-to-date, it can just be returned to the client. However, if the data comes back stale, the KijiScoring server will run the specified ScoringFunction for the user that made the request. The important piece to understand is that the data or recommendations that are being refreshed are only being refreshed for the user that is making the request, rather than a batch operation, which would refresh the data for all users. This is how Kiji avoids doing more work than is necessary. Once the data is refreshed, it’s returned to the user, and written back to HBase for use later on.
A typical Kiji application will include some number of KijiScoring servers, which are stateless Java processes that can be scaled out, and that are able to run a ScoringFunction using a single entity’s data as input. A Kiji application will funnel client requests through the KijiScoring server, which determines whether or not data is fresh. If necessary, it will run a ScoringFunction to refresh any recommendations before they are passed back to the client, and write the recomputed data back to HBase for later use.
Deploying Models to a Production System
A major goal in a real-time recommendation system is to be able to iterate on the underlying predictive models easily, and avoid application downtime to push new or improved models into production. To do that, Kiji provides the Kiji Model Repository, which combines metadata about how the models execute with the code that is used to train and score the models. The KijiScoring server needs to know what column accesses should trigger freshening, the FreshnessPolicy to be applied, and the ScoringFunction that will be executed against user data, as well as the locations of any trained models or external data necessary for scoring against the model. This metadata is stored in a Kiji system table, which is just another HBase table at the lowest level. Additionally, the Model Repository stores code artifacts for registered models in a managed Maven repository. The KijiScoring server periodically polls the Model Repository for newly-registered or -unregistered models, and loads or unloads code as necessary.
Putting It All Together
A very common way to provide recommendations is through the use of collaborative filtering. Collaborative filtering algorithms typically involve building a large similarity matrix to store information relating to a product to other products in the product catalog. Each row in the matrix represents a product pi, and each column represents another product pj. The value at (pi, pj) is the similarity between the two products.
In Kiji, the similarity matrix is computed via a batch training process, and then can be stored in a file or a Kiji table. Each row of the similarity matrix would be stored in a single row in the product table in Kiji in its own column. In practice, this column has the potential to be very large, since it would be a list of all the products in the catalog and similarities. Typically, the batch job will also do the work of picking only the most similar items to put into the table.
This similarity matrix is accessed at scoring time through the KeyValueStore API, which gives processes access to external data. For matrices that are too large to store in memory, storing the matrix in a distributed table enables the application to only request the data that is necessary for the computation, and dramatically reduce the memory requirements.
Since we’ve done a lot of the heavy lifting ahead of the scoring phase, scoring becomes a fairly simple operation. If we wanted to display recommendations based on an item that was viewed, a non-personalized scoring function would just look up the related products from the product table and display those.
It’s a relatively simple task to take this process a little further and personalize the results. In a personalized system, the scoring function would take a user’s recent ratings and use the KeyValueStore API to find products similar to the products that the user had rated. By combining the ratings and the product similarities stored in the products table, the application can predict the ratings that the user would give related items and offer recommendations of the products with the highest predicted ratings. By limiting both the number of ratings used and the number of similar products per rated product, the system can easily handle this operation as the user is interacting with the application.
In this article, we’ve seen, at a high level, how Kiji can be used to develop a recommendation system that refreshes recommendations in real time. By leveraging HBase to do low latency processing, using Avro to store complex data types, and processing data using MapReduce and Scalding, applications can provide relevant recommendations to users in a real-time context. For those who are interested in seeing an example of this system, there is code for a very similar application located on theWibiData Github.
About the Author
Jon Natkins(@nattyice) is a field engineer at WibiData where he is focused on helping users build Big Data Applications on Kiji and WibiEnterprise. Prior to WibiData, Jon worked in software engineer roles for Cloudera and Vertica Systems.
Elasticsearchis an open source, distributed real-time search and analytics engine for the cloud. It’s built onApache Lucenesearch engine library and provides full text search capabilities, multi-language support, a query language, support for geolocation, context aware did-you-mean suggestions, autocomplete and search snippets.
Elasticsearch supports RESTful API using JSON over HTTP for all of its operations, whether it's search, analytics or monitoring. In addition, native clients fordifferent languageslike Java, PHP, Perl, Python, and Ruby are available. Elasticsearch is available for use under the Apache 2 license. The first milestone ofelasticsearch-hadoop1.3.M1 wasreleasedin early October.
InfoQ spoke with Costin Leau from Elasticsearch team about the search and analytics engine and how it integrates with Hadoop and other Big Data technologies.
InfoQ: Hi Costin, can you describe what Elasticsearch is and how it helps with Big Data requirements?
Elasticsearch is a scalable, highly-available, open-source search and analytics engine based on Apache Lucene. It is easy to "dig" through your data and to "zoom" in and out - all in real-time. At Elasticsearch, we’ve put a lot of work into delivering a good user experience out of the box. We set good defaults that make it easy to get started, but we also give you full access, when you need it, to customize virtually every aspect of the engine.
For example, you can use it to search your data, from the typical queries ('find all items X that match Y') to filtering (or “views” in Elasticsearch terms), highlighted search snippets which provide context for each result, geolocation ('find all items with Z miles'), did-you-mean suggestions and powerful aggregations (Elasticsearch’s “facets”) such as date histograms or statistics.
Elasticsearch can both search and store your data. It offers a semi-structured, schema-free, JSON based model; you can just toss JSON documents at it and Elasticsearch will automatically detect your data types and index your documents, or you can customize the schema mapping to suit your purposes, e.g. boosting individual fields or documents, custom full text analysis, etc.
You can start with a small instance on your laptop or take it to the cloud with tens or hundreds of instances, all with minimal changes. Elasticsearch will automatically scale horizontally and grow with your app.
It runs on the JVM and uses JSON over a RESTful HTTP interface, so any client/language can interact with it. There are a plethora ofclients and framework integrationsin various languages that provide native APIs and dedicated DSLs to minimize 'friction' and maximize performance.
Elasticsearch is a great fit for "Big Data" because its scalable, distributed nature allows it to search - and store - vast amounts of information in near real-time. Through the Elasticsearch-Hadoop project, we are enabling Hadoop users (includingHive,PigandCascading) to enhance their workflow with a full-blown search engine. We give them a rich language to ask better questions in order to get clearer answers, significantly faster.
InfoQ: Elasticsearch is used for real-time full text search. Can you tell us how real-time full text search differs from traditional data search?
In layman’s terms, traditional search is a subset of full text search.
Search as implemented by most data stores is based on metadata or on parts of the original data; for efficiency reasons, a subset of data that is considered relevant is indexed (such as the entry id, name, etc...) and the rest is ignored. This results in a small index when compared to the data size, but one that doesn't fully cover the data set. Full text search alleviates this problem by indexing and searching the entire corpus at the expense of increased need for storage.
Traditional search is typically associated with structured data because it is easier for the user to know what is relevant and what is not; however, when you look at today's requirements, most data is unstructured. Now, you store all data once and then, when necessary, look at it many times across several different formats and structures; a full-text search approach becomes mandatory in such cases, as you can no longer afford to just ignore data.
Elasticsearch supports both structured data search and full text search. It provides a wide variety of query options from keywords, Boolean queries, filters and fuzzy search just to name a few, all exposed via a rich query language.
Note that Elasticsearch provides more than simple full text search with features such as:
Geolocation: Find results based on their location.
Aggregation/Facets: aggregate your data as you query it: e.g. Find the countries that visit your site for a certain article or the tags on a given day. As aggregations are computed in real-time, the aggregations change when queries change; in other words, you get immediate feedback on your data set.
InfoQ: What are the design considerations when using Elasticsearch?
Data is king so focus on that. In order for Elasticsearch to work with the data the way you want to, it needs to understand your 'requirements'. While it can make best effort guesses about your data, your domain knowledge is invaluable in configuring your setup to support your requirements. It all boils down to data granularity or how the data is organized. To give you an example, take the logging case which seems to be quite common; it's better to break down the logs into time periods - so you end up with an index per month, or per week or even per day, etc. - instead of having them all under one big index. This separation makes it easy to handle spikes in growth and the removal or archiving of old data.
InfoQ: Can you discuss the design and architecture patterns supported by the Elasticsearch engine?
An index consists of multiple shards, each of which is a “mini” search engine in its own right; an index is really a virtual namespace which points at a number of shards. Having multiple shards makes it easy to scale out by just adding more nodes. Having replica shards - copies of each primary shard - provides high availability and increased read throughput.
Querying an index is a distributed operation, meaning Elasticsearch has to query one copy of each shard in the index and collate the results into a single result set. Querying multiple indices is just an extension of the same process. This approach allows for enormous flexibility when provisioning your data store.
With the domain specific knowledge that a user has about their application, it is easy to optimize queries to only hit relevant shards. This can make the same hardware support even greater load.
InfoQ: How does Elasticsearch support data scalability?
Elasticsearch has a distributed nature in order to be highly-available and scalable. From a top-level view, Elasticsearch stores documents (or data records) under indices (or collections). Each collection is broken down into multiple pieces called shards; the bigger an index is, the more shards you want to allocate. (Don't be afraid to overdo it, shards are cheap.) Shards are spread distributed equally across an Elasticsearch cluster depending on your settings and size, for two reasons:
For redundancy reasons: By default, Elasticsearch uses one copy for each shard so in case a node goes down, there's a backup ready to take its place.
For performance reasons: Each query is made on an index and is run in parallel across its shards. This workflow is the key component for improving performance; if things are slow, simply add more machines to the cluster, and Elasticsearch will automatically distribute the shards, and their queries, across the new nodes.
This approach gives organizations the freedom to scale both vertically (if a node is slow, upgrade the hardware) and horizontally (if a cluster is slow, add more nodes to increase its size).
InfoQ: What are the limitations or cautions of using this solution?
The main challenge that we see is with users moving from a SQL world to what you could call acontextual searchone. For retrieving individual data entries (the typicalget), things are still the same - specify the id and get the data back; however, when it comes to dataexplorationthere are different constructs to be used, from the type of analysis performed to what type of search or matching algorithm is used, e.g.fuzzyqueries.
InfoQ: Can you talk about the advantages of using Elasticsearch along with Hadoop technology?
Hadoop by design is a distributed, batch-oriented platform for processing large data sets. While it's a very powerful tool, its batch nature means it takes some time to produce the results. Further, the user needs to code all operations from scratch. Libraries like Hive and Pig help, but don't solve the problem completely; imagine reimplementing geolocation in Map/Reduce.
With Elasticsearch, you can leave search to the search engine and focus on the other parts, such as data transformation. The Elasticsearch-Hadoop project provides native integration with Hadoop so there is no gap for the user to bridge; we provide dedicated InputFormat and OutputFormat for vanilla Map/Reduce, Taps for reading and writing data in Cascading, and Storages for Pig and Hive so you can access Elasticsearch just as if the data were in HDFS.
Usually, data stores integrated into Hadoop tend to become a bottleneck due to the number of requests generated by the tasks running in the cluster for each job. The distributed nature of the Map/Reduce model fits really well on top of Elasticsearch because we can correlate the number of Map/Reduce tasks with the number of Elasticsearch shards for a particular query. So every time a query is run, the system dynamically generates a number of Hadoop splits proportional to the number of shards available so that the jobs are run in parallel. Your Hadoop cluster can scale alongside Elasticsearch, and vice-versa.
Furthermore, the integration enables cluster co-locations by exposing shard information to Hadoop. Job tasks are run on the same machines as the Elasticsearch shards themselves, eliminating network traffic and improving performance through data locality. We actually recommend running Elasticsearch and Hadoop clusters on the same machines for this very reason, especially as they complement each other in terms of resource usage (IO vs. CPU).
Last but not least, Elasticsearch provides near real-time responses (think milliseconds) that significantly improve a Hadoop job’s execution and the cost associated with it, especially when running on ‘rented resources' such asAmazon EMR.
InfoQ: Is there any integration between Spring Framework and Elasticsearch?
Yes, check out theSpring Data Elasticsearchproject on Github. The project was started by our community members Biomed Central and we are happy to participate in the development process with them by using and improving it. The project provides the well-known Spring template as a high-level abstraction, asRepositorysupport on top of Elasticsearch and extensive configuration through XML, JavaConfig and CDI. We are currently looking into aggregating existing integrations under the same umbrella, most notably David Pilato'sspring-elasticsearch.
About the Interviewee
Costin Leauis an engineer at ElasticSearch, currently working with NoSQL and Big Data technologies. An open-source veteran, Costin led various Spring projects and authored an OSGi specification.
After at least four years of tough criticism, it's time to come to an intermediate conclusion about the state of NoSQL. So many things have happened around NoSQL that it is hard to get an overview and value what goals have been achieved and where NoSQL failed to deliver.
In many fields NoSQL has been more than successful in the industry and academics too. Universities are starting to understand that NoSQL must to be increasingly adopted by the curriculum. It is simply not enough to teach database normalization up and down. This, of course, does not mean that a profound relational foundation is wrong. To the contrary, NoSQL is certainly a perfect and important addition.
The NoSQL Space has exploded in just 4-5 years to about 50 to 150 new databases.nosql-database.orglists about 150 such databases, including some quite old but still strong dinosaurs like Object Databases. And, of course, some interesting mergers have happened, such as the CouchDB and Membase deal leading to CouchBase. But we will discuss each major system later in this article.
Many people have been assuming a huge consolidation in the NoSQL space. However this has not happened. The NoSQL space simply exploded and is still exploding. As with all areas in computer science - like e.g. programming languages - there are more and more gaps opening up for a huge amount of databases. And this is all in line with the explosion of the Internet, big-data, sensors and many more technologies in the future, leading to more data and different requirements about their treatments. In the past four years we saw only one significant system leaving the stage: the German graph database Sones. The vast amount of NoSQL databases continues to live happily either in the open-source space, without any considerable money turnaround, or in the commercial space.
Visibility and Money
Another important point is the visibility and industry adoption. In this space we can see a huge difference between the old industry - protecting the investment - and the new industry: mostly startups. While nearly all of the hot web-startups such as Pinterest or Instagram do have a hybrid (SQL + NoSQL) architecture, the 'old' industry is still struggling with NoSQL adoption. But the observation here is that more and more companies like these are trying to cut out a part of their data streams to be processed and later on analyzed with NoSQL solutions like Hadoop, MongoDB, Cassandra, etc.
And this leads as well to a strong increased demand on developers and architects with NoSQL knowledge. Arecent surveyshowed the following latest developer skills requested by the industry:
So there are 2 NoSQL databases in the top ten for technology requirements here. And even one before iOS. If this isn't a praise, what else?!
But NoSQL adoption is going faster and deeper as one might think at first glance. In a well known whitepaper Oracle stated in the summer of 2011 that NoSQL DBs feel like an ice cream flavor, but you should not get too attached because it may not be around for too long. Only a few months later Oracle showed its Hadoop integration into a Big Data Appliance. And even more, we saw the launch of their own NoSQL database, which was a revised BerkeleyDB. Since then, there has been a race for all vendors to integrate Hadoop. Microsoft, Sybase, IBM, Greenplum, Pervasive, and many more do already have a tight integration. A pattern that can be seen everywhere: can't fight it, embrace it.
But one of the strongest but silent signs of a broad NoSQL adoption is that NoSQL databases are getting a PaaS standard. Thanks to the easy setup and management of many NoSQL databases, DBs like Redis or MongoDB can be seen in dozens of Paa-Services as Cloud-Foundry, OPENSHIFT, dotCloud, Jelastic, etc. As everything moves more and more into the cloud this becomes a huge momentum for NoSQL to put pressure on classic relational databases. Having the choice to select either MySQL/PostGres or MongoDB/Redis, for example, will force them to think twice about their model, requirements and raise other important questions.
An interesting indicator for technologies is also the ThoughtWorks radar which always contains a lot of interesting stuff, even if you do not fully agree with everything contained in it. Let's have a look at their radar from October 2012 in picture 1:
In their platform quadrant they list five databases:
MongoDB (tial but close to adopt)
If you look at this you see that at least four of these have received a lot of venture capital. If you add up all the venture capital in the entire NoSQL Space you will surely count up to something in between 100M and a billion dollars! Neo4j is one of one of these examples for getting 11m $ in a series B funding. Other companies that received $10-30M in funding were Aerospike, Cloudera, DataStax, MongoDB, CouchBase, etc. But let's have a look at the list again: Neo4j, MongoDB, Riak and CouchBase have been in this space for the last four years and have constantly proven to be among market leaders for specific requirements. Then, DB number 5 –Datomic - is more than astonishing, a complete new database, with a complete new paradigm written by a small team. Must be really hot stuff and we will dig into it a bit later when discussing all DBs briefly.
Many people have asked for NoSQL standards, failing to see that NoSQL covers a really wide range of models and requirements. Hence unified languages for all major areas such as Wide Column, Key/Value, Document and Graph Databases will surely not be available for a long time because it's impossible to cover all areas. Several approaches, such as Spring Data, try to add a unified layer but it's up to the reader to test if this layer is a leap forward in building a polyglot persistence environment or not.
Mostly the graph and the document databases have come up with standards in their own domain. The graph world is more successful with its tinkerpop blueprints, Gremlin, Sparql, and Cypher. In the document space we have UnQL and jaql filling up some niches, although the first lacks real world support by a NoSQL database. But with the force of Hadoop many projects are working on bridging famous ETL languages such as Pig or Hive to other NoSQL databases. So the standards world is highly fragmented, but only due to the fact that NoSQL luckily is a very wide area.
One of the best overviews of the database landscape has been given by Matt Aslett in areport of the 451 Group. He recently updated his picture giving us more insights to the categories he mentioned. As you can see in the following picture, the landscape is highly fragmented and overlapping:
(Click on the image to enlarge it)
Picture 2: The database landscape by Matt Aslett (451 group)
As you can see there are several dimensions in one picture. Relational vs. Non-relational, Analytic vs. Operational, NoSQL vs. NewSQL. The last two categories have the well known sub-categories Key-Value, Document, Graph and Big Tables for NoSQL and Storage-Engines, Clustering-Sharding, New Databases and Cloud Service Solutions. The interesting part of this picture is that it is increasingly difficult to put a database to an exact location. Everyone is now trying fiercely to integrate features from databases found in other spaces. NewSQL Systems implement core NoSQL features. NoSQL Systems try more and more to implement 'classic' features as SQL support or ACID or at least often configurable persistence.
It all started with the integration of Hadoop that tons of relational databases now offer. But there are many other examples: e.g. MarkLogic is now starting to ride the JSON wave and thus also hard to position. Furthermore more multi-model databases appear, such as ArangoDB, OrientDB or AlechemyDB (which is now a part of the promising Aerospike DB). They allow to start with one database model (e.g. document / JSON model) and add other models (graph or key-value) as new requirements pop up.
Another wonderful sign of a beginning maturity is the book market. After two German books published in 2010 and 2011 we saw the Wiley book by Shashank Tiwari. Structured like a hurricane and full of great deep insights. The race continued with two nice books in 2012. The 'Seven Databases in Seven Weeks' is surely a masterpiece. Freshly written and full of practical 'hands-on' insights: it takes six famous NoSQL databases and adds PostGreSQL to the mix, Making it a top recommendation. On the other side P.J. Sandalage and Martin Fowler take a more holistic approach to cover all the characteristics and help evaluating your path and decisions with NoSQL.
But there is more to come. It is just a matter of time till a Manning book appears on the scene: Dan McCreary and Ann Kelly are writing a book called: "Making Sense of NoSQL" and the first MEAP chapters are already available.
After starting with concepts and patterns, their chapter 3 will surely look attractive:
Building NoSQL Big Data solutions
Building NoSQL search solutions
Building NoSQL high availability solutions
Using NoSQL to increase agility
This is a new fresh approach and will surely be worth reading.
State of the Leaders
Let's give each NoSQL leader a quick consideration. As one of the clear market leaders, Hadoop is a strange animal. On one hand it has an enormous momentum. As mentioned before, each classic database vendor is in a hurry to announce Hadoop support. Companies such as Cloudera and MapR continue to grow and new Hadoop extensions and successors are announced every week. Even Hive and Pig continue to get even better acceptance. Nevertheless, there is a fly in the ointment: Companies still complain about an unstructured mess (reading and parsing files could be even faster), MapReduce is far 'too batch' (even Google goes away from it), management is still hard, stability issues, and local training/consultants are still hard to find. But even if you could address some of the issues it's still a hot question, if Hadoop will grow as it is or it will change dramatically.
The second leader, MongoDB, also suffers from flame wars, and it might be the nature of things that leading DBs get the most criticism. Nevertheless, MongoDB goes at a fast pace and criticism mostly is:
a) concerning old versions or b) due to the lack of knowledge on how to deal with it in a correct way. This recently culminated in absurd complaints that the 32 bit version can only handle 2GB, although MongoDB states this clearly in the download section and recommends the 64 bit version.
Anyway, MongoDBs partnerships and funding rounds push ambitious roadmaps with hot stuff:
the industry called for some security / LDAP features which are currently being developed
full text search will be in soon
V8 for MapReduce is coming
even a finer level then collection level locking will come
and a Hash Shard Key is on the way
Especially this last point catches the interest of many architects. MongoDB was often blamed (also by competitors) for not implementing a concise consistent hashing which is not entirely correct because such a key can be easily defined. But in the future there will be a config for a hash shard key. This means the user is up to decide if a hash key for sharding is useful or if he needs some (perhaps even rare) advantages of selecting his own sharding key. Surely this increases the pressure on other vendors and will lead to fruitful discussion when to use a sharding key.
Cassandra is the next in line and quite doing well adding more and nicer features such as better querying. However rumors won't stop telling that running a Cassandra cluster is not piece of cake and requires some hard work. But the most attractive issue here is surely DataStax. The new Company on top of Cassandra - 25 Million round C funding - is mostly addressing analytics and some operational issues. Especially the analytics was a surprise for many because in the early days Cassandra was not known as a powerful query machine. But as this has changed in the latest version the query capabilities may be sufficient enough for some modern analytics.
CouchBase also looks like a brilliant solution in terms of scalability and latency despite the strong winds that Facebook and hence Zynga are now facing. It's surely not a hot query machine but if they could improve querying in the future the portfolio would be quite complete. The merger with the CouchDB founders was definitely a strong step and it's worthwhile to see the great influences of CouchDB in CouchBase. On every database conference it's also funny to hear the discussions, if CouchDB is doing better or worse after Damien, Chris and Jan have left. One can only hear extreme opinions here. But who cares as long as the DB is doing fine. And it looks like it does.
The last NoSQL DB to be mentioned here is of course Riak, which also improved dramatically in functionality and monitoring. It continues to have a good reputation mostly in terms of stability: "rock solid, invisible and good for your sleep". The Riak CS fork also looks interesting in terms of the modularity of this technology.
Beside the market leaders, newcomers are always interesting to evaluate. Let's dig into some of them.
Elastic Search surely is one of the hottest new NoSQL products and just got a 10m $ in series A funding, and that for a good reason. As a scalable search engine on top of Lucene it brings many advantages: a) a company on top providing services and b) leveraging all the achievements that Lucene has conceived in the last years. It will surely infiltrate the industry now more than before, attacking all the big players in the semi-structured information space.
Google also send it's small but fast LevelDB into the field. And it serves as a basis for many usages with specific requirements such as compression integration. Even Riak integrated LevelDB. It remains to be seen when all the new Google internal databases such as Dremel or Spanner will find their way out as open-source projects (like Apache Drill or Cloudera Impala).
Another tectonic shift surely was DynamoDB at the start of 2012. They call it the fastest growing service ever launched at Amazon. It's the ultimate scaling machine. New features are coming slowly but the focus on SSDs and latency is quite amazing.
Multi-model databases are also a field worthwhile to have a look on. OrientDB, its famous representative, is by far not a newcomer but it is improving its capabilities quite fast. Perhaps too fast because some customers might now be happy that OrientDB has reached Version 1.0 and thus hopefully gained a lot more stability. Graph, Document, Key-Value support combined with transactions and SQL are reasons enough to give it second try. Especially the good SQL support makes it interesting for analytic solutions such as Penthao. Another newcomer in this space is ArangoDB. It is moving fast and it doesn't flinch from comparing itself in benchmarks against the established players. However, again the native JSON and graph support saves a lot of effort if new requirements have to be implemented and the new data has a different model that must be persisted.
By far the biggest surprise this year was Datomic. Written by some rock stars of the Clojure programming language in an incredible short time, it unveils a whole bunch of new paradigms. Furthermore it has made its way into the ThoughtWorks radar with the recommendation to have a look at it. And although it is 'just' a layer on top of established databases it brings a huge amount of advantages, such as:
a time machine
a fresh and powerful query approach
a new schema approach
caching & scaling features
Currently, DynamoDB, Riak, CouchBase, Infinispan and SQL are supported as the underlying storage engine. It even allows you to mix and query different DBs simultaneously. Many veterans have been surprised that such a radical paradigm shift can be possible. Luckily it is.
To conclude, let us address three points:
Some new articles by Eric Brewer on the CAP theorem should have come several years earlier. Inthis articlehe states that "2 of 3" is misleading, explaining the reasons, why the world is more complicated than a simple CP/AP i.e. ACID/BASE choice. Nevertheless, thousands of talks and articles kept on praising the CAP theorem without any critical review for years. Michael Stonebraker was the strongest censor of the NoSQL movement (and the NoSQL space owes him a lot), pointing to these issues some years ago! Unfortunately, not many are listening. But now that Eric Brewer updated his theorem, the time of simple CAP statements is definitely over. Please be at the very front in pointing out the true and diverse CAP implications.
As we all know, the weaknesses of the classical relational databases have lead to the NoSQL field. But it was just a matter of time for the empire to strike back. Under the term "NewSQL" we can see a bunch of new engines (such as database.com, VoltDB, GenieDB, etc. see picture 2), improving classic solutions, sharding and cloud solutions. Thanks to the NoSQL movement.
But as many DBs try to implement every feature, clear frontiers vanish. The decision for a database is getting more complicated than ever. You have to know about 50 use cases, 50 DBs and you should answer at least 50 questions. The latter have been gathered by the author in over 2 years of NoSQL consulting and can be found here:Select the Right Database,Choosing between NoSQL and NewSQL.
It's common wisdom that every technology shift - since client-server and before - is about ten times more costly to switch to. For example, switching from Mainframe to Client-Server, Client-Server to SOA, SOA to WEB, RDBMS to Hybrid Persistence, etc. And as a consequence, many companies hesitate and struggle in adding NoSQL to their portfolio. But it is also known that the early adopters who are trying to get the best out from both worlds and thus integrate NoSQL fast will be better positioned for the future. In this regard, NoSQL solutions will be here to stay and always a gainful area for evaluations.
About the Author
Prof. Dr. Stefan Edlichis a senior lecturer at Beuth HS of Technology Berlin (University of App. Sc.). He wrote more than 10 IT books for publishers such as Apress, OReilly, Spektrum/Elsevier and others. He runs theNoSQL Archive, did NoSQL consulting, organizes NoSQL Conferences, wrote the world’s first two NoSQL books and is addicted to the Clojure programming language.