The Data Science ProcessA data science project may begin with a very well-defined question -- Which of these 200 genetic markers are the best predictors of disease X? -- or an open-ended one -- How can we decrease emergency room wait time in a hospital? Either way, once the motivating question has been identified, a data science project progresses through five iterative stages:
- Harvest Data: Find and choose data sources
- Clean Data: Load data into pre-processing environment; prep data for analysis
- Analysis: Develop and execute the actual analysis
- Visualize: Display the results in ways that effectively communicate new insights, or point out where the analysis needs to be further developed
- Publish: Deliver the results to their intended recipient, whether human or machine
Below is an outline of challenges that arise in each stage; it is meant to give the reader a sense of the scale and complexity of such challenges, not to enumerate them exhaustively:
Challenges of stage 1: Harvest Data
This is a classic needle-in-a-haystack problem: there exist millions of available data sets in the world, and of those only a handful are suitable, much less accessible, for a particular project. The exact criteria for what counts as "suitable data" will vary from project to project, but even when the criteria are fairly straightforward, finding data sets and proving that they meet those criteria can be a complex and time-consuming process.
When dealing with public data, the data sets are scattered and often poorly described. Organizations ranging from the federal government to universities to companies have begun to publish and/or curate large public data sets, but this is a fairly new practice, and there is much room for improvement. Dealing with internal data is not any easier: within an enterprise, there can be multiple data warehouses belonging to different departments, each contributed to by multiple users with little integration or uniformity across warehouses.
In addition to format and content, metadata, particularly provenance information, is crucial: the history of a data set, who produced it, how it was produced, when it was last updated, etc. also determine how suitable a data set is for a given project. However, this information is not often tracked and/or stored with the data, and if it is, it may be incomplete or manually generated.
Challenges of stage 2: Cleanse/Prep Data
This stage can require operations as simple as visually inspecting samples of the data to ones as complex as transforming the entire data set. Format and content are two major areas of concern.
With respect to format, data comes in a variety of formats, from highly structured (relational) to unstructured (photos, text documents) to anything in between (XML, CSVs), and these formats may not play well together. The user may need to write custom code in order to convert the data sets to compatible formats, use programming languages or purpose-built software, or even manually manipulate the data in programs like Excel. This latter path becomes a non-option once the data set exceeds a certain size.
With respect to content and data quality, there are numerous criteria to consider, but some major ones are accuracy, internal consistency, and compliance with applicable regulations (e.g. privacy laws, internal policies). The same data may be stored in different ways across data sets (e.g. multiple possible formats for date/time information), or the data set may have multiple "parent" data sets whose content must meet the same criteria.
In the Hadoop ecosystem, one common tool for initially inspecting and prepping data is Hive. Hive is commonly used for ad-hoc querying and data summarization, and in this context, Hive's strengths are its familiar SQL-like query language (HiveQL) and its ability to handle both structured and semi-structured data.
However, Hive lacks the functional flexibility needed for significantly transforming raw data into a form more fitting for the planned analysis, often a standard part of the "data munging" process. Outside of Hadoop, data scientists use languages such as R, Python or Perl to execute these transformations, but these tools are limited to the processing power of the machines - often the users' own laptops - on which they are installed.
Challenges of stage 3: Analyze
Once the data is prepared, there is often a "scene change;" that is, the analytics take place in an environment different from the pre-processing environment. For instance, the latter may be a data warehouse while the former is a desktop application. This can prove to be another logistical challenge, particularly if the pre-processing environment has a greater capacity than the analytical one.
This stage is where data science most clearly borrows from, or is an extension of, statistics. It requires starting with data, forming a hypothesis about what that data says about a given slice of reality, formally modelling that hypothesis, running data through the model, observing the results, refining the hypothesis, refining the model and repeating. Having specialist-level knowledge of the relevant sector or field of study is also very helpful; on the other hand, there is also a risk of confirmation bias if one's background knowledge is given undue weight over what the numbers say.
Given that data sets can contain up to thousands of variables and millions or billions of records, performing data science on these very large data sets often calls for approaches such as machine learning and data mining. Both involve programs based on statistical principles that can complete tasks and answer questions without explicit human direction, usually by means of pattern recognition. Machine learning is defined by algorithms and performance metrics enabling programs to interpret new data in the context of historical data and continuously revise predictions.
A machine learning program is typically aimed at answering a specific question about a data set with generally known characteristics, all with minimal human interaction. In contrast, data mining is defined by the need to discover previously unknown features of a data set that may be especially large and unstructured. In this case, the task or question is less specific, and a program may require more explicit direction from the human data scientist to reveal useful features of the data.
Challenges of stage 4: Visualize
In the former, the data scientist uses visualization to more easily see the results of each round of testing. Graphics at this stage are often bare-bones and simple: scatterplots, histograms, etc. - but effectively capture feedback for the data scientist on what the latest round of modeling and testing indicates.
The emphasis on visualization in data science is a result of the same factors that have produced the increased demand for data science itself: the scale and complexity of available data has grown to a point where useful insights do not lie in plain view but must be unearthed, polished, and displayed to their best advantage. Visualization is a key part of accomplishing those goals.
Challenges of stage 5: Publish
This stage could also be called "Implementation." The potential challenges here are as varied as the potential goals of the use case. A data science project could be part of product development for a smartphone app, in which case the intended recipient of the output could be either the app designers, or could be supporting an already-deployed app.
Similarly, a data science project could be used by financial firms to inform investment decision-making, in which case the recipient could be either a piece of automated trading software or a team of brokers. Suffice it to say that data scientists will need to be concerned with many of the same features of the output data sets as they were with the input data sets - format, content, provenance - and that a data scientist's mastery of the data also involves knowing how to best present it, whether to machines or to humans.