Big data is the latest buzz phrase, and as such many still don’t understand it properly. In fact, I met people who think it is related to Information Overload: much data, right?
Wrong, of course. There is much information in Big Data, but it is a very different animal from good ol’ info overload; and, as we shall see, it may even play a role in mitigating the latter.
What we mean by Big Data is not “Lots of data”
What we really mean is a class of computerized algorithms for analyzing very large data sets in such a way as to allow the creation of interesting and valuable insights. For example: analyze all the articles published in the New York Times over the past 150 years, add all of Wikipedia – that’s a really huge corpus – and identify interesting correlations.
Well, for example, say it turns out that whenever a poor country suffered a drought, and a year or so later there was a large rainy storm in the same country, then a Cholera epidemic broke out after the storm. If this chain of events kept occurring, you will now be able to predict epidemics ahead of time! Which is exactly what Dr. Kira Radinsky, an Israeli computer scientist and entrepreneur, predicted for Cuba in 2011. Cuba hadn’t had any Cholera for over a century, but when a drought was followed by a storm, sure enough, the prediction came true. There were many similar insights in 150 years of the NYT, and Kira went on to create a company based on such capabilities. Her company was recently bought for $40 Million… small wonder, considering that it could tell the future!
Big data algorithms would be of no use without huge sets of data to analyze, and these days, when everything is connected and networked, such sets are readily available. Researchers and companies can get their hands on sets of hundreds of millions, or even billions, of tweets, news stories, and records of the countless sensor readings collected by the growing “Internet of Things”. Want to consider how to design a public transit system in your city? Grab the data from every taxi and bus ride over the past decade, throw in traffic density measurements from traffic cameras and sensors, and figure out where citizens need to go and at what hours. Want to research how language evolves in time? Feed in millions of books and documents, and have the computer figure out how words morph and change in form and frequency, and how this is affected by geography, demography, and so on. Want to understand how political sentiment runs across a country? Take a billion tweets, analyze each for sentiment, plot them as colored dots on a map… and the patterns spring out at you. And so on… the possibilities are endless, and span numerous fields of study and practice.
Of course, being endless, these possibilities can get scary – they are not all as beneficial as predicting epidemics. The ability to distill insights from data that would be impossible to manage manually can have many repercussions. For example, consider the statement “Today in the United States we have somewhere close to four or five thousand data points on every individual … So we model the personality of every adult across the US, some 230 million people”. This was not said by the KGB, nor the CIA, mind you; it was said during the recent US election campaign by the CEO of Cambridge Analytica, a private company that assisted the Republican party in voter research. Five thousand data points should allow one to know the person better than their own mother would; such insights allow micro-targeting the campaign and manipulating public opinion at the individual voter level. Research shows us that statistical analysis of a sufficiently big data set can achieve far better prediction than a human expert would. From here it is a short hop to predicting with good accuracy the tendencies of individuals to act in certain ways – for example, to commit crimes. Yes, as in Minority Report.
So where does this tie into Information Overload?
Not via the data load: Big Data allows the computer to handle the load, and computers are pretty good at it (and don’t complain either). The connection is indirect, and goes through Artificial Intelligence. AI is increasingly playing a role in solutions to Information Overload – the kind we knowledge workers must deal with. It is the key to Knowmail’s power as an Inbox assistant. But in order to be effective, AI systems must be trained, both initially and over time, by feeding them examples of correct decisions – whether in user preferences, in Natural Language recognition, or in other classification domains. The availability of big data sets enables these systems to train to much better levels of performance. Siri, Cortana, or Google Now wouldn’t be able to converse with you so smoothly without access to millions of earlier conversations by other users. AI is enabled by Big Data, and at the same time is generating ever more data. Scary? Exhilarating? You decide (or ask Cortana for her opinion)!