Managing Big Data – its not about how big it is, but how you use it

In 2006 I joined the London Organising Committee of the Olympic and Paralympic Games (LOCOG) as employee number 73 with a brief to build a website for the Games and help reconnect the Olympics with young people. At that time, while the internet was widely used in most westernized economies, the majority of connections were dial-up – not broadband. The big names were Myspace and Bebo; Facebook was in its infancy and Twitter hadn’t been invented. 16-24 year olds used the social media platforms available twice as much as any other age group, and only the most determined mobile phone users used their handset to access the internet – the iPhone wasn’t released until 2008. So instead of building one website by 2012, my small team had built 80 as well as three apps and 150 social media channels which collectively attracted a global audience of 150m people by the end of the Games.

A big part of the appeal of the products and services we built was data. In data terms, the Games is essentially 15,000 athletes competing in 7,000 heats of live sport covering 305 events over 16 days – where an event is something, such as women’s 100m breaststroke or men’s basketball, which leads to a medal. That’s a lot of data to record, process and present.

But it wasn’t truly rich or complex enough on the face of it to be what is commonly understood to be “big” data – although I think some of the disciplines we followed and learnt in managing and presenting it have lessons for the management of data sets of any size.

Organisations – usually large ones – have been using computers to analyse data for decades. A meaningful change in the art of the possible came about in the last handful of years as reliable, scalable distributed computing using everyday hardware has become a reality and has made large scale data analysis affordable.

Probably the most famous software now commonly used to analyse large data sets is Hadoop, which was created by a Yahoo! engineer in 2005. Given Hadoop was named with a refreshing lack of hubris after its inventor’s son’s toy elephant, it’s probably not surprising to find that it’s been freely available since its inception. However, the hardware it needs to run on was still expensive when it was first released and it only really took off at the end of the last decade when “Cloud” or “Elastic” instances of networked computers, such as Amazon’s S3, became widely available.

Using parallel processing (where networks of small computers pretend to be a big one), organisations and individuals can, in theory, now process all the data they have collected without having to watch their children have grandchildren in the meantime.

For start-ups, this has had huge benefits. Scaling, keeping indexes on massive databases, getting search results back quickly, storing unstructured data in searchable ways and so on was almost impossible in the mid-2000s. Without this ability, Facebook – founded in 2004 – and Twitter – invented in 2006 – couldn’t exist in their current form.

Scaling is now considered reasonably simple and cheap. Even individual sever capacity – a big issue for start-ups in the dot com boom – has changed completely. Were it not for reasons of security and efficiency, some large companies could probably process all their hundreds of millions of daily transactions on one big machine with one big disk. Of course, there is a difference between having the capacity and programming it right.

One of the biggest data sets we built for London 2012 was a database of 2.5 million ‘Expressions of Interest’ in ticket buying which ran between 2006 and 2011. It started with 46,000 who signed up when we won the bid to host the Games in July 2006, and the list grew to 5 million. In a carefully managed online ballot, the allocation for most tickets was sold out in a matter of days. Who knew rhythmic gymnastics would be the most popular sport at the time of application?

We learnt a lot about people’s intent; when they wanted to apply in an application window; the platforms they used – Apple or Windows, mobile or desktop; where they were from, and what sports they liked. We were feeling pretty good about ourselves. Then we opened the real-time sale at 6:00am on a spring morning. By 6:15, the system was in meltdown; by 6:20, the BBC’s lead Olympic journalist was ringing telling us that we faced the biggest crisis in Olympic history; and within an hour, we had sold a key venue over three times as the ticketing system fell apart. Fortunately, that venue was the water polo arena and the event was synchronised swimming. When we emailed the ticket holders, they were only too happy to accept an alternative.

If you are doing it right, data in 2014 isn’t really BIG until it can’t fit on one machine, can’t be usefully queried by one machine, and can’t be queried interactively at all. You have to be a very large business indeed to need to be analysing “big data” on a day-to-day basis in your normal run of business.

For an online business, big data means server logs, clicks on a website, interactions with apps and emails being opened. This sort of data has the potential to bring value to your business. But, it’s important to bear in mind a few rules of conduct:

• Do not conflate size with usefulness. The amount of data a business can collect is increasing exponentially. But it doesn’t mean that it contains more information. And if you set up a big data project simply to analyse your waste product, you are – as Americans would say – simply “huffing your own fumes”
• Only analyse data that is likely to yield insight. If your CIO tells you that you need a big data project simply because you have a lot of it, unless it’s for compliance reasons, you need to ask him how much of that data, when analysed, is going to deliver any real insight. For the informative part of your data to provide anything insightful, it has to be interpretable, relevant and show something new
• Be careful not to keep on adding data sets to your mix, even if you think they might be insightful. Past a certain point, the more data you collect, the less insight you are likely to get from it because more of it will be redundant: more noise, even less signal.

However, understanding the data sets themselves is much less important than the business structure around the analysis process. There is little point spending time and money analysing data unless you are genuinely enthusiastic about it at the highest level of your business, and can commit to putting good people on it who can communicate results in language everyone can understand. I had a constant battle reporting digital analytics to a senior board and resorted to everyday comparisons – this week we are bigger than Chelsea, this week we are bigger than Sainsbury’s – even resorting to telling the story through song lyrics by the end. If and when you have actionable insight, you have a business structure that is ready and able to learn from it and do something about it. Otherwise it’ll just be another report gathering dust in a desk drawer or “big data” email archive.

Where a lot of Silicon Valley firms are being successful in using big data sets is in data mining or what you might call machine learning. Pulling relevant samples from big data sets has led to success, for example Google has used it to good effect for machine translation, voice recognition and image processing, where millions of samples (some of which have been pre-classified) are used to train a model to classify the rest.

But even with sampling, you need to be sure that big data is the answer. In London, we were constantly approached by firms who promised they could tell us what the world thought of us by analysing social media. As we had 100 journalists covering our beat, watching social media and ringing our teams every day, what we really needed was insight into what issues weren’t currently live but were about to break – and not one company could help with that. In the end, the main practical value of social listening for us was to contextualise the noise. When a journalist rang telling us that the world was raging about the latest ticket sale – based on three tweets they had just read – our Communications team were able to counter by telling them that while people were talking about the issue, ten times as many were interested in David Beckham’s new haircut or Pippa Middleton’s rear view.

In short, the insights you can glean from analysing big data are only as useful as the organisation’s ability to learn from them and act upon them.

What I’ve been reading this week June 27, 2014

What I’ve been reading this week June 20, 2014