The Future of Data

Stephan Shakespeare

An initial perspective on the future of data by Stephan Shakespeare - Co-founder and CEO of YouGov

YouGov has partnered with the world’s largest open foresight programme Future Agenda 2.0 to help bring experts together from around the world to debate and explore emerging issues that will shape the future of data.

The initial perspective written below by YouGov's Co-founder and CEO, Stephan Shakespeare, will be shared and challenged by experts from leading organisations in a series of five qualitative workshops in Dubai, Cologne, London, New York and Singapore, to help build a stronger, richer and deeper perspective on the future of data over the next 10 years.

The Global Challenge

In the last ten years we have seen an explosion in the amount of structured data we produce through our everyday activities.  All on-line activity, such as credit card payments, web searches and mobile phone calls, leaves a data exhaust, little puffs of evidence about our behaviour, what we do and how we think.  This can now be stored, shared and analyzed, transforming it from meaningless numbers into life-changing tools.

Like it or not, we live in a world where personal information can be accessed at the click of a key on a massive scale. Although there are myriad benefits (medicine, education and the allocation of resources are obvious areas), there are also significant risks. The threat of cyber warfare is a good example.   There is no turning back, so what does this mean for society going ahead? I believe that in order to maximize the benefits and minimize the risks over the next ten years we will have to fundamentally change our behaviours, our structures and our businesses.

Writing today, my real concern is that we haven’t yet got a clear understanding of the risks this new data-fuelled world brings and therefore even less about how to deal with them. That doesn’t mean we should over-react. Indeed the opposite: if we haven’t thought them through, we are more likely to over-react in some areas and under-prepare in others.  We are obviously severely under-prepared against cyber-terrorism, as we see with the recent Sony debacle.

As an example of over-reaction, look at concerns about health data, which, in the main, can be addressed through the judicial use of sandbox technologies and severe penalties for misuse. Surely it is counterintuitive to miss out on the enormous social benefit of sharing health data because we haven’t thought properly about how to deal with potential risks? How do we exploit data knowledge to positive effect and what are the key challenges going forward?

The first big issue is how to keep the opportunities equal.  I believe that all levels of society should benefit from the information data crunching can deliver.  But just because the capability is there, it is not a guarantee that it will be shared unilaterally. Currently this is an area where new inequalities could grow, as well as existing equalities get worse. Data sharing and the science of getting value from data is obviously much more advanced in the advanced economies.  It’s quite possible that these skills will be used to accelerate their own national well being, both commercial and social, leaving less technologically based societies behind. It would be wrong to assume that technology will be a leveler at all times. Yes, it has the potential, but the hope that it will have an equalizing effect is by no means assured.

There are obvious tensions between sharing, privacy and freedom. But we must be wary of erecting a virtual net curtain, hiding the voyeur and leaving the public vulnerable.  Why shouldn’t youthful misdemeanors be left in the ether? I think they should.  After all, we know that silly things sometimes happen – even to ourselves.  The trick is for us all is to know and acknowledge what is public, and to act accordingly. Years ago, we lived in small communities. Our doors were unlocked and our neighbours knew our every move.  It was considered normal. Our community is now global, but the principal remains the same.  Some guidelines do need to be established if we are to maximize the social benefit of data; we must develop an agreement about what privacy really is in reality as well as in the virtual world. This will involve thinking afresh about the relationship between the citizen, governments, and corporations.

Understanding data ownership will become a bigger issue than it already is today. Consumers and end users will want to own and control their personal data, but this seemingly straightforward statement grows more difficult to achieve with each passing day. There isn’t much information that we can easily say belongs to just one person.  Consider two people having a chat in a café. The content belongs to both of them; the fact of their meeting belongs to all who observe it. If I have a contagious disease, we don’t consider that information my personal property. When a doctor takes your temperature, does that information belong to you, the doctor or the hospital?  Data is useful to everyone, so we must get used to sharing particularly as more and more of our lives becomes digitised and new issues arise. The challenge is to develop our ethical and legal apparatus for this, establishing a set of agreed principals and regulatory framework that can act as the basis

History is littered with evidence that shows how we consistently fail to identify the next big threat. The Greeks didn’t recognize the Trojan Horse; the Allies in the First World War weren’t initially concerned about aerial warfare. Similarly, I believe we are currently under-playing the potential impact of cyber-attack. As more control systems are connected to the web, more vulnerability will inevitably appear.

Cyber-security, which involves protecting both data and people, is facing multiple threats; cybercrime and online industrial espionage are growing rapidly. Last year, for example, over 800 million records were lost, mainly through cyber attacks. A recent estimate by the think tank, Centre for Strategic and International Studies (CSIS), puts the annual global cost of digital crime and intellectual property theft at $445 billion—a sum roughly equivalent to the GDP of a smallish, rich European country such as Austria.

Although the attacks on Target, eBay and Sony have recently raised the risk profile in boardrooms around the world, law enforcement authorities are only now grappling with the implications of a complex online threat that knows no national boundaries. Protection against hackers remains weak, and security software is continuously behind the curve. Wider concerns have been raised by revelations of mass surveillance by the state; a growing number of countries now see cyber space as a new stage for battle, and are actively recruiting hackers as cyber warriors. How to minimize this threat is key to all of our futures.

Options and Possibilities

The way data will be optimized is changing.  It is not enough to know single lines of information.  Data must be connected and multi layered to be relevant. It means knowing not one thing or ten things or even 100 things about consumers but tens and hundreds of thousands of things. It is not big data but rather connected data – the confluence of big data and structured data – that matters.  Furthermore, with the growth in social tools, applications and services, the data in the spider’s web of social networks will release a greater value. In the UK alone, YouGov now knows 120,000 pieces of information about over 190,000 people.  This is being augmented every day.  The analysis of this allows organisations both public and private to shape their strategy for the years ahead.

We are also growing a huge data-store of over a million people’s opinions and reported behaviours. These are explicitly shared with us by our panelists to use commercially as well as for wider social benefit (indeed we pay our panelists for most of the data shared).

But many companies exploit data that has been collected without genuine permission; it’s used in ways that people do not realize, and might object to if they did. This creates risks and obstacles for optimising the value of all data.  Failure to address this will undermine public trust.  We all have the right to know what data others have and how they are using it, so effective regulation about transparency and the use of data is needed.  Europe is leading the way in this respect.

Governments, however, are the richest sources of data, accounting for the largest proportion of organized human activity (think health, transport, taxation and welfare). Although the principle that publicly-funded data belongs to the public remains true, certainly in the UK, we can expect to see more companies working with, through and around governments. Having the largest coherent public sector datasets gives Britain huge advantages in this new world

It is clear that encouraging business innovation through open data could transform public services and policy making, increasing efficiency and effectiveness. In the recent Shakespeare Review* it was found that data has the potential to deliver a £2bn boost to the economy in the short-term, with a further £6-7bn further down the line. However, the use of public data becomes limited when it involves private companies.  To address this in the future, when companies pitch to work with governments, preference should be given to those that share an open data policy, or at least the relevant parts. Furthermore, where there is a clear public interest in wide access to privately generated data – such as trials of new medicines — there is a strong argument for even greater transparency.

Aside from governments (whose data provision is by no means perfect) access to large, cheap data sets is difficult.  The assumption is that everything is available for crunching and that the crunching will be worth the effort. But the reality is that there are different chunks of big data – scientific, business and consumer – which are collected, stored and managed in multiple ways.  Access to relevant information let alone the crunching of it will take some doing. On top of this, much corporate and medical data is still locked away, stuck on legacy systems that will take years to unpick.  Many would say the sensible thing is to adopt a policy of standardization, particularly for the medical industry, given the growing number of patients living with complex long-term conditions. And yet, many standards abound.  So in addition to regulation around transparency, over the next ten years we can expect to see agreement on standardisation in key areas.

But the potential benefits from this wealth of information is only available if there are the skills to interpret the data.  Despite Google’s chief economist, Hal Varian, saying that “the sexy job of the next ten years will be statisticians;” number crunchers are in short supply (or at least not always available in the right locations at the right time). By 2018 there will be a “talent gap” of between 140,000 and 190,000 people, says the Mc­Kinsey Global Institute. The shortage of analytical and managerial talent is a pressing challenge, one that companies and policy makers must address.

Separately, it is entirely plausible that the infrastructure required for the storage and transmission of data may struggle to keep pace with the increasing amounts of data being made available. Data generation is expanding at an eye-popping pace: IBM estimates that 2.5 quintillion bytes are being created every day and that 90% of the world’s stock of data is less than two years old. A growing share of this is being kept not on desktops but in data centres such as the one in Prineville, Oregon, which houses huge warehouses containing rack after rack of computers for the likes of Facebook, Apple and Google. These buildings require significant amounts of capital investment and even more energy. Locations where electricity generation can be unreliable or where investment is limited may be unable to effectively process data and convert it to useful, actionable knowledge. Yet, it is the growing populations in these same areas – parts of Asia and Africa, for example – that will accelerate data creation, as more of its inhabitants develop online activities and exhibit all the expected desires of a newly emerging middle class.  How should this be managed?

*Shakespeare Review: An independent Review of Public Sector Information, May 2013

Proposed Way Forward

Economically connected data can clearly benefit not only private commerce but also national economies and their citizens. For example, the judicial analysis of data can provide the public sector with a whole new world of performance potential.  In a recent report, consultancy firm McKinsey suggested that if US healthcare were to use big data effectively, the sector could create more than $300 billion in value every year, while in the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone.

It is understandable that many citizens around the world regard the collection of personal information with deep suspicion, seeing the data flood as nothing more than a state or commercial intrusion into their privacy. But there is scant evidence that these sorts of concerns are causing a fundamental change in the way data is used and stored.

That said, we must all have a care. As public understanding increases, so will concerns about privacy violation and data ownership. If it is discovered that companies are exploiting data that has been collected without genuine permission and are using it in ways that have no societal benefit, there is a considerable risk of a public backlash that will limit opportunities for everyone.  The shelf life of the don’t- know-so-don’t-ask approach to data collection will be short.

Some in the industry believe governments need to intervene to protect privacy. In Britain, for instance, the Information Commissioner’s Office is working to develop new standards to publicly certify an organisation’s compliance with data-protection laws. But critics think such proposals fall short of the mark—especially in light of revelations of America’s National Security Agency (NSA) ran a surveillance programme, PRISM, which collected information directly from the servers of big technology companies such as Microsoft, Google and Facebook.

From a marketing perspective, detailed awareness of customer habits will enable technology to discriminate in subtle ways. Some online retailers already use “predictive pricing” algorithms that charge different prices to customers based on a myriad of factors, such as where they live, or even whether they use a Mac or a PC.

Transport companies provide another interesting use case for connected data. Instead of simply offering peak and off-peak pricing, they can introduce a far more granular, segmented model. Customers can see the cost of catching a train, and the savings that can be made by waiting half an hour for the next one. They can also see the relative real-time costs of alternative transport to the same destination, and perhaps decide to take a bus rather than a train. They have the ability to make informed, value-based judgments on the form of travel that will best suit their requirements. Such dynamic systems will provide greater visibility of loading and so allow the use of variable pricing to nudge passengers into making alternative choices that can improve the efficiency of the overall network.  Benefits all round.  That said, although there may be innocuous reasons for price discrimination, there are currently few safeguards to ensure that the technology does not perpetuate unfair approaches.

Open access to data is reaping its own rewards.  London’s Datastore makes information available on everything from crime statistics to tube delays to, as their website states,  “encourage the masses of technical talent that we have in London to transform rows of text and numbers into apps, websites or mobile products which people can actually find useful.” Many are taking up the challenge, and are delivering real social benefits..  A professor at UCL, for example, has mapped how many people enter and exit Tube stations, and how this has changed over time.  This information has now been used by Transport for London to improve the system.

Impacts and Implications

Looking ahead, I believe the best approach to future-proof access to big data is to ensure there is agreement around its use, not its collection.  Governments should define a core reference dataset, designed to strategically identify and combine the data that is most effective in driving social and economic gain. This will then become the backbone of public sector information, making it possible for other organisations to discover innovative applications for information that were never considered when it was collected.

This approach has the potential for huge societal benefit. The shorter-term economic advantages of open data clearly outweigh the potential costs. A recent Deloitte analysis quantifies the direct value of public sector information in Britain at around £1.8bn, with wider social and economic benefits taking that up to around £6.8bn. Even though these estimates are undoubtedly conservative, they are quite compelling.

And yet, at the same time individuals need to be protected. There are instances where, for very good reasons, ‘open’ cannot be applied in its widest context. I therefore suggest we acknowledge a spectrum of uses and degrees of openness.

For example, with health data, access even to pseudonymous case level data should be limited to approved, legitimate parties whose use can be tracked (and against whom penalties for misuse can be applied). Access should also be limited to secure sandbox technologies that give access to researchers in a controlled way, while respecting the privacy of individuals and the confidential nature of data. Under these conditions, we can create access that spans the whole health system, more quickly and to more practitioners, than is currently the case. The result: We gain the benefits of ‘open’ but without a significant increase of risk.

Nor should we consider ‘free’ (that is, at marginal cost) to be the only condition, which maximises the value of public information. There may be some particular cases when greater benefits accrue to the public with an appropriate charge.  Finally, as big data unquestionably increases the potential of government power to accrue un-checked, rules and regulations should be put in place to restrict data mining for national security purposes.

We will also have to look to how we focus resources within academia.  The massive increase in the volume of data generated, its varied structure and high rate at which it flows, have led to the development of a new branch of science – data science.  Many existing businesses will have to engage with big data to survive. But unless we improve our base of high-level skills, few will have the capacity to create new approaches and methodologies that are simple orders of magnitude better than what went before.  We should invest in developing real-time, scalable machine learning algorithms for the analysis of large data sets, to provide users with the information to understand their behavior and make informed decisions

We should of course strive for an increased shift in capital allocations by governments and companies to support the development of efficient energy supply and robust infrastructure. These investments can prepare us for serving continued growth in world productivity – and help offset the increasing risk for the massive, destructive disruptions in the system that will inevitably, come with our growing dependency on data and data storage.

Innovation in storage capabilities should also be considered. Take legacy innovation, for example. The clever people at CERN use good old-fashioned magnetic tape to store their data, arguing that it has four advantages over hard disks for the long-term preservation of data: Speed (extracting data from tape is about four times as fast as reading from a hard disk). Reliability (when a tape snaps, it can be spliced back together; when a terabyte hard disk fails, all the data is lost). Energy conservation (tapes don’t need power to preserve data held on them). Security (if the 50 petabytes of data in CERN’s data centre was stored on a disk, a hacker could delete it all in minutes; to delete the same amount from the organisation’s tapes would take years).

The key thing to remember is that numbers, even lots of numbers, simply cannot speak for themselves.  In order to make proper sense of them we need people who understand them and their impact on the world we live in.  To do this we need to massively spread academia vertically and horizontally, engaging globally at all levels, from universities to government to places of work.  The current semi-fractured structure of academia is actually an advantage; it will help us ensure plurality of ideas and approaches. Remember, we’re not just playing with numbers; we’re dealing with fundamental human behaviors. We need philosophers and artists as well as mathematicians, and we must allow them to collectively develop the consensus.

If we get it right, over the next 10 years I would expect to see individuals being more comfortable with living in the metaphorical glass house, allowing their personal information to be widely accessible in return for the understanding that it will enable them to enjoy a richer, more ‘attuned’ life. I would also expect to see a maturing of our individual data usage, a coming of age with regards to appreciating and integrating data and less of a fascination at its very existence. We will also perhaps see a new segment appearing, those who elect to reduce their data noise by avoiding needless posts of photos of their lunch and such.

We will also see a structural shift in employment, markets and economies as the focus in maturing economies continues to shift away from manufacturing and production and toward a new tier of data-enabled jobs and businesses. As we demand more from our data, we will need to match it with a skilled workforce that can better exploit the information available.

After all the noise perhaps it would be wise to remember that big data, like all research, is not a crystal ball and statisticians are not fortune tellers. More information, and the increasing ability to analyse it, simply allows us to be less wrong. I believe that we will have continued growth in world productivity, probably accelerating over the next ten years, even as the risk for massive destructive disruptions in the system increases.  There will be huge challenges and even dangers, but I am confident we will be the better for it.  Every time humans have faced a bigger crisis, they have emerged stronger. Although we can’t be sure that this will always be the case, now is the time to be bold and ambitious.

Register your interest

To register your interest in receiving the final outputs from all five Future of Data workshops please email Antonia Stockwell.