Thursday, 31 July 2014

BigData using R

The role of programming languages in data science

When it comes to data science, programming languages are ubiquitous and essential tools.  Two of the most popular languages used in data science are Python and R; other commonly-used languages include Perl, Java, C#, and C++. These languages are used during all stages of data science: harvesting, cleaning or "munging," analysis, and visualization.
While the debates over which language is best are often characterized as "religious wars," some favor the balanced approach of using both at different stages, depending on the strengths of each language. To quote Rachel Schutt, currently a data scientist at Google, "Don't get too attached to tools, languages and methods; use what gets the job done. Be versatile."
With that in mind, this section will discuss one such language/tool, R, as a "Swiss Army knife" for statistical analysis.

R: a Swiss Army knife for data scientists

R Itself

R is an open-source statistical programming language and environment. In addition to being available to any user free of charge, R provides considerably more flexibility to users than proprietary statistical software packages. With R, data scientists can be especially creative in how they approach their problem, from powerful facilities for data cleaning to cutting-edge analytical techniques to refined visualization.

At first, R may not seem like the right tool for everyone. Compared to other analytics tools, there is a steep learning curve. R is a programming language in its own right, an implementation of the S language. The basic R download also includes a minimal user interface, relying only on the command line in the terminal or console.

Fortunately, other groups have developed more user-friendly interfaces such as RStudio IDE, which give R a similar feel to other statistical packages. There are also numerous resources for all levels of R users from novices to advanced users, as shown below. For learning and troubleshooting, R has thorough and accessible documentation supported by a responsive and knowledgeable online community.

Screenshot of RStudio

The most effective R users will have some background in both programming and statistics, like most data scientists. Today, R is used widely in fields such as public health, biostatistics, climate science, market research, economics and financial analysis. Large enterprises often use R for prototyping analysis from start to finish. Known corporate users include Google, Facebook, The New York Times Infographics, Kickstarter, Bing, and Zillow.
One of the most powerful features of R is the extensive library of packages with advanced statistical techniques and custom functions developed by the robust community of expert users. These include packages for domain-specific analysis such as PerformanceAnalytics and Quantmod (finance), geoplot and RGoogleMaps (location data), and bioconductor (bioinformatics). Others packages enable advanced data science methods such as machine learning and data mining. Finally, packages extend already-impressive graphics capabilities, including ggplot and Lattice for static graphics and D3 for interactive graphics.

Learning R

There are many resources for learning and using R. The Comprehensive R Archive Network (CRAN) is an online repository for documentation from the R Development Core Team and all packages developed for R. Key documents include An Introduction to R, R Data Import/Export, and R Language Definition. The R FAQ also provides a broad overview of the language.
For novices, various websites offer tutorials in R. Online learning portals such as Coursera offer courses in data analysis using R as the language of instruction.

Learning R online:

In addition to these online resources, there are various books on R. R in a Nutshell by Joseph Adler and R Cookbook by Paul Teetor provide an excellent introduction and reference.

For advanced users, R has a strong community represented in numerous websites and blogs. Much of the content is accessible through the dedicated search engine RSeek. The R Development Core Team organizes an annual conference called useR! as well as The R Journal.


One immediate challenge for R is that all operations are performed in memory. In the context of very large data sets, this slows down computation or makes it impossible to even load the data into R, if the machine does not have sufficient memory resources. However, new R packages are available that adapt R for use in Hadoop. While they are not meant to duplicate all R functionality for use in Hadoop, they greatly enhance the appeal of R for enterprises interested in data science.


RHadoop: rhbase, rhdfs, rmr

The RHadoop package is comprised of three open source R libraries. The rhdfs library allows R to read and write files from the Hadoop File System (HDFS). The rhbase library translates R commands into HBase. Finally, the rmr library allows R users to write MapReduce code in a syntax similar to the R language. This requires users to specify the 'map' and 'reduce' portion of a function or script using familiar R constructs and syntax. Altogether, RHadoop provides an interface to Hadoop that is familiar to R users and addresses the limitations of performing computations in memory on very large data sets.

Additional Links

Related Posts Plugin for WordPress, Blogger...