A qualified “Woo-hoo” for big data

Big Data tools are a game-changer. Relational Database Management Systems (RDBMSs) are reaching the point where they (at least on their own) can no longer enable us to get the maximum insight into the vast amount of information available.

June 12, 2015

Also, the way in which our systems are required to scale is becoming less organic. We need to meet unpredictable peaks readily, while being able to scale back when those peaks recede (think of the latest mobile app craze here – what were its data storage and analysis requirements 6 months ago? What will they be in 6 months time?)

In Back to the Future, Doc Brown frequently used the phrase “One point twenty-one gigawatts!”. Back in 1985 the prefix giga was pretty unfamiliar to a layman, which perhaps explains why Christopher Lloyd mis-pronounced it “jiggawatts” and no-one called him on it.

main image

In 2015 jiggawatts sounds wrong because we’re all very familiar with the prefix, with the measurements we encounter frequently entering the billions of units. The laptop I’m writing this on is in the gigabytes of memory and storage and in the gigahertz of processing speed. I have an external hard-drive in the terabytes, and that prefix is familiar to most people these days too.

In 2013 the total volume of global unique digital data is estimated to have surpassed 4.45 zettabytes. That’s four point fourty-five trillion jiggabytes, as Doc Brown would have put it.


Where’s it all coming from?

The volume of digital data we use in our daily lives has increased exponentially over the last 20 years, but that is nothing to the rate at which the data we produce (knowingly and unknowingly) is increasing.

Several factors are contributing to this, not least the “internet of things”. The way in which our devices are connected to each other is significant in data expansion because up until now the rate at which human beings can actively create data has been a brake. If you’re only measuring the things that human beings actively record then you’re limited by the fact that humans are, well, pretty slow.

We used to collect data that ultimately, somewhere in the line, had been generated by an actual person recording it manually. This is no longer the case. If you go for a 5 minute walk with your smart-phone in your pocket then the data relating to your physical movements is recorded in multiple places (by the OS of the phone, by apps that sit on it, by systems at your mobile provider’s Network Operations Centre), and in more detail than you could possibly enter into a spreadsheet if you had a full day to devote to it. That may well seem a bit creepy, and it probably should, but that’s a different article.


What to do with it all

Collecting all this data is one thing, analysing it another, and making commercial decisions based on your findings another altogether. Most organisations are starting to find that the data available to them vastly exceeds the amount they could make use of using relational database engines, the analytical tools that come with them and, significantly, the hardware they could possibly afford to run it all.

Big Data tools such as NoSQL (Not Only SQL)-based databases are increasingly seen as the solution to these problems.

Relational databases are traditionally judged by their ability to meet a quartet of requirements known as ACID: Atomicity, Consistency, Isolation, Durability.

Distributed computer systems are traditionally thought to be able to meet only two of three properties referred in the CAP (or Brewer’s) theorem – Consistency, Availability, Partition tolerance.

These two sets of limitations have been traditionally considered self-evident and inviolable, and NoSQL approaches tend to relax or abandon one of the three CAP requirements – usually consistency.

NoSQL breaks apart the relational model, abandons the restrictions of pre-defined schemas and reverts to a document-based repository model. This is ideal for distributing across a sharded architecture that can utilise lots of smaller, cheaper, bits of equipment rather than requiring a huge hardware outlay.

These distributed architectures are what is called “eventually consistent”, meaning that database write operations can be performed with maximal speed and the engine will then be left to cascade the new values around the nodes of the system in its own time. If this sounds a bit lax then, well it kind of is, but it’s not without precedent. The global DNS system that drives domain-targeted HTTP requests to the appropriate IP addresses operates in a very similar way. Plus, it’s a trade-off that is necessary to make Big Data work, and it’s important to note that Big Data doesn’t come without drawbacks.

Big Data tools offer huge advantages in scalability and use of processing tools, but remember that use of Big Data tools requires a paradigm shift and you need developers and DBAs who are either experienced in the use of them, or who at the very least “get it”.

Remember that rules of referential integrity in relational databases, when properly applied, have made it hard to mess things up too badly for 30 years or more. If your normalised database has been sensibly constructed then if a developer asks the application to commit a value to a table, and it’s the wrong (or at least non-primary) location for that value to be stored, then it will either cascade throughout the various other places that value is stored, or the database engine will say no. This built-in enforcement of rigour has allowed developers to delegate a lot of responsibility when working with data, and they need to be aware that this can no longer be the case with Big Data and NoSQL databases.

The document-based, schema-free model of Big Data requires data to be denormalised – not just in a bolt-on data warehouse, but in the core model. This means that the same data point can be stored in more than one place, and it’s up to the architect and developers to make sure it stays consistent.

You need to make sure the rules of what goes where are now enforced in your application, ideally via a very tightly controlled Data Access Layer. Consider only allowing a small subset of your developers to work on this part of the application (or at least that your revision control system utilises strictly-enforced gated check-ins).


A lawnmower and nail scissors

Before the invention of the lawnmower, groundsmen used to maintain bowling greens with scissors. That sounds made-up but it isn’t. Now, if you were a groundsman then you would have been pretty excited when the first lawnmowers arrived. Well, maybe not the first ones as they were made out of cast iron and impossibly heavy, but some of the later refinements would have made a manufactured lawnmower a life-changing boon. Most people would agree, though, that that doesn’t mean a lawnmower should replace scissors in all their multiple uses, and that trying to use one to cut your toenails would be taking things a step too far.

It’s the same with Big Data tools, and there is nothing wrong with taking a hybrid approach. If your data landscape takes the form of a reasonably finite volume of customers, users, organisations, objects etc, producing a vast and unwieldy body of data then why not have a relational database storing the former and a NoSQL repository storing and processing the latter?

Apache have a tool called Sqoop which transfers data from RDBMS to Hadoop and back again, and many organisations are using what is known as Polyglot Persistence (the concurrent use of multiple languages (and associated technologies) in their approach to Big Data.


Integrated hybrid approaches?

There are a also few providers who are starting to produce their own Hybrid NoSQL solutions, and Google recently published a new paper “F1: A Distributed SQL Database That Scales.” that describes a database that claims to maintain the ACID requirements while being truly distributed. It also uses a more SQL-like dialect that traditional database developers will find more familiar and comfortable. It sounds too good to be true, but Google are using it in anger for their AdWords program and it was a Google white paper that kick-started the Big Data discussion 8 years ago, so it would be wise to pay attention.

Ultimately, Big Data is here and can be hugely powerful to your organisation. It’s something that pretty much everyone should be considering, and most will end up adopting in some form or another. Just make sure you’re doing so for the right reasons and not getting caught up in group hysteria – there are caveats, and these should be understood, but the good news is that most of the drawbacks can be mitigated by taking a measured approach and using the right tools for the right job. If you’re not sure if you need or can make use of Big Data at the moment, then caution might be the best strategy – this is a maturing area and in a couple of years’ time the marketplace will be offering more sophisticated hybrid products that will represent the best of both worlds.