Tuesday, 21 October 2014

The Developer’s Guide to Data Science

When developers talk about using data, they are usually concerned with ACID, scalability, and other operational aspects of managing data. But data science is not just about making fancy business intelligence reports for management. Data drives the user experience directly, not after the fact.
Large scale analysis and adaptive features are being built into the fabric of many of today’s applications. The world is already full of applications that learn what we like. Gmail sorts our priority inbox for us. Facebook decides what's important in our newsfeed on our behalf. E-commerce sites are full of recommendations, sometimes eerily accurate. We see automatic tagging and classification of natural language resources. Ad-targeting systems predict how likely you are to click on a given ad. The list goes on and on.
Many of the applications discussed above emerged from web giants like Google, Yahoo, and Facebook and other successful startups. Yes, these places are filled to the brim with very smart people, working on the bleeding edge. But make no mistake, this trend will trickle down into “regular” application development too. In fact, it already has. When users interact with slick and intelligent apps every day, their expectations for business applications rise as well. For enterprise applications it's not a matter of if, but when.
This is why many enterprise developers will need to familiarize themselves with data science. Granted, the term is incredibly hyped, but there's a lot of substance behind the hype. So we might as well give it a name and try to figure out what it means for us as developers.

From developer to data scientist

How do we cope with these increased expectations? It's not just a software engineering problem. You can't just throw libraries at it and hope for the best. Yes, there are great machine learning libraries, like Apache Mahout (Java) and scikit-learn (Python). There are even programming languages squarely aimed at doing data science, such as the R language. But it's not just about that. There is a more fundamental level of understanding you need to attain before you can properly wield these tools.
This article will not be enough to gain the required level of understanding. It can, however, show you the landmarks along the road to data science. This diagram (adapted from Drew Conway's original) shows the lay of the land:
Data science venn diagramg
As software engineers, we can relate to hacking skills. It's our bread and butter. And that's good, because from that solid foundation you can branch out into the other fields and become more well-rounded.
Let's tackle domain expertise first. It may sound obvious, but if you want to create good models for your data, then you need to know what you're talking about. This is not strictly true for all approaches. For example, deep learning and other machine learning techniques might be viewed as an exception. In general though, having more domain-specific knowledge is better. So start looking beyond the user-stories in your backlog and talk to your domain experts about what really makes the clock tick. Beware though: if you only know your domain and can churn out decent code, you're in the danger zone. This means you're at risk of re-inventing the wheel, misapplying techniques, and shooting yourself in the foot in a myriad of other ways.
Of course, the elephant in the room here is “math & statistics.” The link between math and the implementation of features such as recommendation or classification is very strong. Even if you're not building a recommender algorithm from scratch (which hopefully you wouldn't have to), you need to know what goes on under the hood in order to select the right one and to tune it correctly. As the diagram points out, the combination of domain expertise and math and statistics knowledge is traditionally the expertise area of researchers and analysts within companies. But when you combine these skills with software engineering prowess, many new doors will open.
What can you do as developer if you don't want to miss the bus? Before diving head-first into libraries and tools, there are several areas where you can focus your energy:
  • Data management
  • Statistics
  • Math
We'll look at each of them in the remainder of this article. Think of these items as the major stops on the road to data science.

Data management

Recommendation, classification, and prediction engines cannot be coded in a vacuum. You need data to drive the process of creating/tuning a good recommender engine for your application, in your specific context. It all starts with gathering relevant data, which might already be in your databases. If you don’t already have the data, you might have to set up new ways of capturing relevant data. Then comes the act of combining and cleaning data. This is also known as data wrangling or munging. Different algorithms have different pre-conditions on input data. You'll have to develop a strong intuition for good data versus messy data.
Typically, this phase of a data science project is very experimental. You'll need tools that help you quickly process lots of heterogeneous data and iterate on different strategies. Real world data is ugly and lacks structure. Dynamic scripting languages are often used to filter and organize data because they fit this challenge perfectly. A popular choice is Python with Pandas or the R language.
It's important to keep a close eye on everything related to data munging. Just because it's not production code, doesn't mean it's not important. There won't be any compiler errors or test failures when you silently omit or distort data, but it will influence the validity of all subsequent steps. Make sure you keep all your data management scripts, and keep both mangled and unmangled data. That way you can always trace your steps. Garbage in, garbage out applies as always.


Once you have data in the appropriate format, the time has come to do something useful with it. Much of the time you’ll be working with sample data to create models that handle yet unseen data. How can you infer valid information from this sample? How do you even know your data is representative? This is where we enter the domain of statistics, a vitally important part of data science. I've heard it said: “a Data Scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.”
What should you know? Start by mastering the basics. Understand probabilities and probability distributions. When is a sample large enough to be representative? Know about common assumptions such as independence of probabilities, or that values are expected to follow a normal distribution. Many statistical procedures only make sense in the context of these assumptions. How do you test the significance of your findings? How do you select promising features from your data as input for algorithms? Any introductory material on statistics can teach you this. After that, move on the Bayesian statistics. It will pop up more and more in the context of machine learning.
It's not just theory. Did you notice how we conveniently glossed over the “science” part of data science up till now? Doing data science is essentially setting up experiments with data. Fortunately, the world of statistics knows a thing or two about experimental setup. You'll learn that you should always divide your data into a training set (to build your model) and a test set (to validate your model). Otherwise, your model won’t work for real-world data: you’ll end up with an overfitting model. Even then, you're still susceptible to pitfalls like multiple testing. There's a lot to take into account.


Statistics tells you about the when and why, but for the how, math is unavoidable. Many popular algorithms such as linear regression, neural networks, and various recommendation algorithms all boil down to math. Linear algebra, to be more precise. So brushing up on vector and matrix manipulations is a must. Again, many libraries abstract over the details for you, but it is essential to know what is going on behind the scenes in order to know which knobs to turn. When results are different than you expected, you need to know how to debug the algorithm.
It's also very instructive to try and code at least one algorithm from scratch. Take linear regression for example, implemented with gradient descent. You will experience the intimate connection between optimization, derivatives, and linear algebra when researching and implementing it. Andrew Ng's Machine Learning class on Coursera takes you through this journey in a surprisingly accessible way.

But wait, there's more...

Besides the fundamentals discussed so far, getting good at data science includes many other skills, such as clearly communicating the results of data-driven experiments, or scaling whatever algorithm or data munging method you selected across a cluster for large datasets. Also, many algorithms in data science are “batch-oriented,” requiring expensive recalculations. Translation into online versions of these algorithms is often necessary. Fortunately, many (open source) products and libraries can help with the last two challenges.
Data science is a fascinating combination between real-world software engineering, math, and statistics. This explains why the field is currently dominated by PhDs. On the flipside, we live in an age where education has never been more accessible. Be it through MOOCs, websites, or books. If you want read a hands-on book to get started, read Machine Learning for Hackers, then move on to a more rigorous book like Elements of Statistical Learning. There are no shortcuts on the road to data science. Broadening your view from software engineering to data science will be hard, but certainly rewarding.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...