Big data: The answer is… 42! | Mobileum

Written by Pedro Duque | 16/05/2014

For all of those who are not fans of Douglas Adams and his fabulous Hitchhiker’s Guide to the Galaxy, “42” is "The Answer to the Ultimate Question of Life, the Universe, and Everything.” It is calculated by an enormous supercomputer over a period of 7.5 million years. Unfortunately no one knows what the question is.

Big Data is everywhere lately.

The term (and concept) of “Big Data” as we use it today has been around since the ‘90s. It was introduced by John R. Mashey while he was working for SGI. Since then data has been growing at a tremendous rate, with estimates of 2.5 quintillion bytes of data being produced daily. This data comes from multiple sources, including not only the web (social networks, user click stream), but also from industrial sensors, from the internet of things. The data stream is growing at the same time as cloud computing is reshaping the IT industry.

Every medium and large company is looking at its humongous data stores, filled with unstructured information from several sources, and trying to figure out how to get intelligence from them. And they turn to Big Data tools to find the solution.

Big Data tools are mainly based on “noSQL” solutions originated in the open source world. Companies understand that information is crucial no matter what the business is and to take control of their data, they need to engage in data management. In data management there is some data best suited for noSQL solutions and other data best suited for traditional data stores. Data with high volume, variety, and complexity makes noSQL solutions attractive, although a hybrid solution is usually the best approach.

The cost of using structured data rises with volume and complexity. Big Data is more cost effective for storage and data access, although knowledge on how to best deal with Big Data is still scarce and expensive. There is a threshold beyond which users are willing to give up the mature capabilities of a relational database for the ability to cost-effectively store and access the data.

Structured data mapped in relational databases based on SQL are the industry standard with extended roots in IT. Big Data is relatively new, and as a new technology it presents several pitfalls in how to best approach data and technical knowledge availability. Learning from data before it is fully organized is completely different from organizing data after knowing what you want to do with it.

Big Data offers several business advantages. Businesses can rely on Big Data for speedier processes, to get more data into the analysis, and to handle and relate complex data.

A credit card company changes the fraud detection process from hourly micro-batch analysis into streaming analysis of all data with near real time pattern detection .
A web store gets all of its users interaction along with real-time recommendations.
An industrial company monitors its facilities with sensors checking temperature, pressure, and power data in order to optimize energy usage.

There are also some operational advantages:

Optimize incumbent database licensing cost as some existing data stores were adapted to fit Big Data models using a traditional approach (e.g., data stored in traditional databases with high hardware and software costs when Haddop could be used over commodity hardware).
Improve search/reporting performance via partitioning either horizontal or vertical.
Take advantage of Cloud services due to Big Data’s parallel nature.
Add new fields to Big Data solutions effortlessly, making it easy to change the data schema on the fly (e.g., it’s really easy to add a new field into a Haddop store, whereas adding a new column in a relational database with billions of entries might take a non-negligible time).
High availability and better disaster recovery as Big Data solutions have embedded data replication mechanisms.

But Big Data is no silver bullet.

Going for the Big Data approach is not a decision to make lightly. The open source nature of most tools, namely Hadoop from Apache Open Source, means that IT managers won’t get the typical support from software providers for solution maintenance and operation.

Also for data exploration, data analysts must rely (typically) on some form of MapReduce techniques as the data access paradigm, which requires a different mindset for data access that is alien to most developers.

Big Data is not a magic wand that you wave and data will organize itself to give you the answers you need. Before searching for results, you need to know what you are looking for. Big Data can help you formulate the questions based on the findings. It is an iterative process, but "unfortunately" to formulate the right question you still need to rely on people. Business knowledge is key and you’ll need experts with analytical skills in order to make sense of all the answers you can get from your Big Data initiatives.

Otherwise you risk ending up only with “42”.

View full post