Kojo explores the local state of diversity in STEM fields, with educators who are looking to change it and a journalist who has been tracking it.
Science fiction fans have long pondered technologies to predict crimes and random events. But some say creative uses of “big data,” the vast amount of digital information we create in our everyday lives, may make such a predictive tool a reality. The potential of the approach, however, is raising sensitive legal and ethical questions. Our guests explore the possibilities — and dangers — of a future in which everything we do is quantified.
- Viktor Mayer-Schönberger Professor of Internet Governance and Regulation, Oxford University; co-author "Big Data: A Revolution That Will Transform How We Live, Work, and Think" (Houghton Mifflin Harcourt)
- Kenneth Cukier Data Editor, The Economist; co-author "Big Data: A Revolution That Will Transform How We Live, Work, and Think" (Houghton Mifflin Harcourt)
MR. KOJO NNAMDIFrom WAMU 88.5 at American University in Washington welcome to "The Kojo Nnamdi Show," connecting your neighborhood with the world. You know, Google processes daily more information than is printed in the entire Library of Congress and Wal-Mart handles more than a million transactions every hour. So we know it's no coincidence when an ad for a hotel in Miami pops up after a search for flights to Miami International Airport. But what about that coupon for a baby stroller you get in the mail after you buy unscented body lotion and prenatal vitamins?
MR. KOJO NNAMDIThe information being collected is already more detailed and more personal than we'd like to think. It may be disturbing to some that a search engine and a department store may know more about you than some people in your own family, but that's just the beginning. As computers grow ever more powerful and the amount of information we generate expands exponentially, we'll be able to do things never before imagined, much of it extremely useful, like detecting an engine part that's about to fail or an imminent heart attack, or who might be likely to commit a crime. That's the promise of big data.
MR. KOJO NNAMDIAlong with that potential, however, is the possibility that the information may be misused. Joining us in studio to discuss all of this is Kenneth Cukier. He is co-author of the book "Big Data: A Revolution That Will Transform How We Live, Work and Think." He is the data editor for the Economist. Kenneth Cukier, thank you for joining us.
MR. KENNETH CUKIERThank you. Happy to be here.
NNAMDIAlso with us in studio is Viktor Mayer-Schonberger. He is the co-author of the book. He's a professor of internet governance and regulation at Oxford University and the author of "Delete: The Virtue of Forgetting in the Digital Age." Viktor Mayer-Schonberger, thank you for joining us.
MR. VIKTOR MAYER-SCHONBERGERThank you.
NNAMDIYou, too, can join this conversation. Give us a call at 800-433-8850. Do you think using data to predict things about our health or our behavior is exciting or a dangerous new frontier? Call us at 800-433-8850. You can send email to email@example.com or send us a Tweet at kojoshow. Viktor, let's talk with the basics. There's no rigorous definition but can you explain the concept of big data?
MAYER-SCHONBERGERYes. Big data currently is seen as something that just denotes bigness, denotes size. But in the book we argue that there are three characteristics that make big data so special and so different than the small data age in which we have lived. We label them more messy in correlations. More means that we now have more data available, not just in absolute terms but more data available relative to the problem that we want to study than ever before. And that gives us a lot of insights into the details, into the granular things that we didn't have before.
MAYER-SCHONBERGERMessiness means that as we have more data, we can also be accepting some inexactitude in how we collect the data and how perfectly curated the data is because we just have so much it. And these two qualities together fuel what we think is the biggest of the three shifts and any one from causality, an elusive quest to find causality targets something much more pragmatic and much simpler called correlations. It means that we are not looking for the why. We are looking for the what and that's good enough.
NNAMDIHow much information do we actually have today? Google, for example, processes more than 24 petabytes of data every day. Can you give us a sense of what that means and how that number is growing, Kenneth?
CUKIERYes. So in human terms what that means is there's about 3 billion searches per day on Google and that's a vast amount of internet searches. And you can do interesting new things with it. But another way of looking at it is this way. A few millennia ago the largest stock of human knowledge was the library of Alexandria. And today we've got about between 200 and 300 libraries of Alexandria for every single person on earth. So that's a huge explosion of information.
CUKIERAnd another way of looking at it is that there -- the stock of information in the world today doubles every three or four years. This is digital and includes analog information. But analog is now so small relative to the digital amount. So it's a complete tsunami. We're being swamped by information. And this, in some ways, is actually a difficulty because we have -- it's harder for us to understand things.
NNAMDIYeah, because at the same time as technology is advancing, information is compounding. Can you talk about what that means in terms of the amount of data we have and what we can do with it?
CUKIERWell, the thing to focus on is what we can do with it because what we've learned in what phase, in a small data world when our -- the techniques we had to collect information and to process it is that we got swamped when we had a lot of information. The situation's now changed. We have lots of information today but the good news is that we have new techniques with which we can have -- we can learn new things from that information.
CUKIERSo take the example of a self-driving car, right. For years we have tried to program a car to drive by itself by giving it the explicit instructions on here is what it looks like when the road ends and you need to make a left. Here is what a red light looks like and you need to brake and then you need to go. Fine. It's a very complex problem.
CUKIERBut a far better way of tackling and cracking that nut would be to simply give lots of data to the car. Because when you give a lot of data to the car we let the car infer the rules of the road. The car can learn with a predictive capability that this is a red light, not a green light. That it can learn that this is probably a bunch of leaves on the street and not a baby carriage. That the bicyclist alongside me is probably not going to swerve. Oh no, it's swerving towards me. I'm going to go out of its way or brake.
CUKIERAnd with lots of data you can do something that you couldn't do with just a little bit of data. You can make these predictions a lot more accurate. And as a result you have a new service like a self-driving car.
NNAMDIViktor, you point out that size matters. You use the example of nanotechnology, the converse of big data. What becomes possible when you're dealing with either something very small or with a vast amount of information?
MAYER-SCHONBERGERLet me give you an example from human health, our human DNA. There is now a private company out there in California, a startup called 23 And Me. And for a couple hundred dollars they will take your DNA sample and analyze, look at about a million markers to tell you whether you're susceptible to some disease, some illness. And that's very helpful. But the problem is that because they're only looking at a million markers -- and we have 4 million base pairs in our DNA, so they're only looking at a very small subset of the information in our DNA, they can only tell us about diseases that we already know where the markers are.
MAYER-SCHONBERGERIf research comes up with a new marker for a disease like Alzheimer's, then 23 And Me needs to re-sequence the entire DNA. We need to do another saliva sample and so forth. Compare that to Steve Jobs who when he found out that he had cancer, he had this entire DNA sequence, not just markers but his entire DNA sequence. That gave his doctors the ability to know exactly the genetic composition of his tumor and to customize the medication that he received so that whenever one medication would lose its potency, they could switch to a different medication.
MAYER-SCHONBERGERThat didn't save him but it gave him years of extra life. And that was possible because they had all the data rather than just parts of it.
NNAMDIIn case you're just joining us, we're having a conversation about the uses and implication and future of big data with Viktor Mayer-Schonberger. He is co-author to the book "Big Data: A Revolution That Will Transform How We Live, Work and Think." He's a professor of internet governance and regulation at Oxford University. He's also the author of "Delete: The Virtue of Forgetting in the Digital Age."
NNAMDIAlso joining us in studio is Kenneth Cukier. He is co-author of the book "Big Data: A Revolution That Will Transform How We Live, Work and Think." He's the data editor for the Economist. We're inviting your calls at 800-433--8850. What do you think it might be possible to do with all the digital information that we are collecting, 800-433-8850? You can send email to firstname.lastname@example.org. Ken, it's an emerging science, but these concepts are already in use. We already rely on big data for apparently countless everyday things. Can you talk about that?
CUKIERSure. So one great example is spam filters. It's -- that's a really difficult problem as well because the spam changes. So what you want to do it be reactive to the way that the spam might change. So people -- obviously if you try the code word Viagra you might ensnare legitimate mail, but you might also ensnare the spam, and that would be a useful thing. But what if you substitute the letter g for the letter -- for the number 6? Our human eye would still read it as Viagra, even though there's a typo, however the spam filter would not because it's been programmed to look for Viagra spelled correctly.
CUKIERSo what we do is we build a system that actually can learn and adapt overtime by seeing how things change. And here you would want to look for signals. And a signal could be first, it's a variant. So there's a statistical probability that it looks like a word that's banned. Secondly, you can imagine that if you sample a few email accounts and watch what's going on in terms of seeing the early signals of people deleting it into their spam filter, then you will know that this is probably spam and you can start deleting it sooner as well. And you can delete it before people have to actively delete the spam.
CUKIERYou can -- if you're the internet service provider, you will -- or the host of the email such as gmail or hotmail, you can shimmy it in to the spam filter where the person then has to look at it with their human eyes to determine if it's spam or not.
NNAMDIViktor, we mentioned earlier some disturbing examples of how our online searches can be used to target us with advertising. But there are also examples that show, if you will, the positive side of all this information. How has, for instance, Google participated in tracking flu outbreaks?
MAYER-SCHONBERGERWell, you might remember, Kojo, 2009 we had the H1N1 flu crisis, a new flu virus being discovered. And the feel was that the Centers for Disease Control in Atlanta and throughout the world that this might be a very deadly virus. And we didn't have a vaccine for that. And now the Centers for Disease Control, they ask general practitioners out there to report cases to them so that they could map the spread of the flu. If you have no vaccine, you need at least to know where the flu is to take some measures against it.
MAYER-SCHONBERGERBut that took a couple of days, perhaps two weeks, to relate it back to the CDC. And that's not helpful. If you have a pandemic that's deadly -- potentially deadly and you know only where the flu virus has been two weeks ago, you have no visibility into the present. And Google thought at that time that they could do better. And what they did was to use mathematical models and big data analysis and compare search quarries, search terms that have been sent to Google with historical data of the flu and how it spread.
MAYER-SCHONBERGERAnd they analyzed, believe it, 50 million search terms. And out of the 50 million search terms they pulled 45 search terms that put together had the best prediction. They checked out 500 million models to find the right one. They found the right one and from that moment on they could predict the spread of the flu in the United States down to the region by just looking at what people search for.
NNAMDIKen, you talk about FairCast to illustrate a shift in how we think about data as no longer static. Talk about that.
CUKIERWell, that's right. The way that you can think of data is the myriad of reuses to which it can be put. So in the case of FairCast, it's very simple. There was a business model. A person was on an airplane and he saw that -- he asked passengers around him how much they paid. And it turns out they all paid more for the air seat than he did, even though he booked his air ticket in advance. And we all sort of think we know that the way you should do it is book farther in advance before the flight and therefore you save money. But in this case it didn't, so he was very upset.
CUKIERHe realized that this was a data problem so he collected the data. And what the data was, was the price flight records of other flights in the same route going to the same destination. And looked at how long in advance people had booked the ticket and what price they paid. Took a small sample, it was very good. He could save people money. So then what he did is he took almost every single flight in American civil aviation for an entire year and every seat on that flight and every price that they paid and ran it in the analysis. This is something that he just simply couldn't have done ten years ago because he didn't have the technology, but also because he didn't have the data.
CUKIERSo he took it and he looked at all of it and now he could save people lots of money. And what we found from that is that the data had become a new raw material. It was a new vital input to business. If you will, it was a new resource. We tend to think of land, labor and capital as these resources. Here the data was put to a secondary use. And it's certainly not the use that the airlines wanted when they digitized passenger records to find out who was on the plane or not. So it's a very clever way with which you can extract money. He ended up selling his company and he sold it for $100 million.
NNAMDIInterestingly enough he wasn't interested in how the airline prices these seats. He was simply interested in looking at the pattern of seat pricing.
CUKIERWell, that's precisely right.
NNAMDIAmazing. Here is David in Washington, D.C. David, you're on the air. Go ahead, please.
DAVIDHi. I wondered if your guests might discuss situations where in perhaps a real time circumstance, excessive data could actually be exactly that, excessive. I'm thinking in particular of circumstances where if, for instance, you had a flight control on an aircraft that had been calibrated to read to the tenths digit of precision. And by taking it that far, you had actually gone beyond the realm of what was useful to you. And then if an error occurred at that tenths digit of precision, it could throw things out of whack. Where if the level of precision were taken perhaps only to five places, such a disturbance would never show up and the control would function normally.
NNAMDIHere's Viktor Mayer-Schonberger. Viktor, go ahead, please.
MAYER-SCHONBERGERWell, thank you very much for that very important question. It seems to me that we have two issues here. One is the obvious question of precision and exactitude. And the beauty of big data is that we don't need to be as precise and as exact anymore. We don't have to go down ten digits behind the comma because we have more data points. And so two digits behind the comma is just good enough.
MAYER-SCHONBERGERNot for all problems but for many problems it's useful to know the direction things were developing and to give us visibility into the future rather than to have everything calculated down to the last penny, the last inch and the last atom. But there is another element to his question that's very important and that is that of excessive data. We, for a millennia as human beings, have lived in a world of data poverty. We had very little data available and so we needed to make decisions based on little data and lots of theory.
MAYER-SCHONBERGERThat's helpful if we have little data and not more, but in the big data age we have massive amounts of data. We can make better decisions. The thing that we need to do is to change the way we make decisions, the processes, the institutions because they all stem from a small data, a data-starved world.
NNAMDIDavid, thank you very much for your call. Before we go to break, Ken, getting back to the story you told earlier about FairCast, analyzing vast amounts of data, and as he did in this case, can often tell us the what, what's going on but not the why. But as in the case of FairCast, knowing the what might be enough in many cases. Can you explain?
CUKIERYeah well, that's right. It depends on the outcome that you want to get from it. So it's always nice to know why things happen. Causality is a good thing. the problem is that sometimes when we think we have it we don't. And other times just trying to get it is very costly and cumbersome. So we find that now that we can collect a lot of data, there's a myriad of things that we can apply our minds to where just knowing what, just getting the pragmatic result is simply all we need.
CUKIERSo in the case of the airfare situation, if you want to save money on an air ticket it just doesn't really matter whether you're saving money because there's a Saturday night stay involved or whether because you're flying from one route to another that's more trafficked or it's a hub, just knowing they're saving money is all you want to do. And so that's good enough. It's what, not why that matters.
NNAMDIIf, in fact however, we can use big data to try to predict things like who is likely to commit a crime, that may tell us who is likely to commit a crime. It won't tell us why. Are we at that point therefore stepping into what might be fairly treacherous waters?
CUKIERAbsolutely. I mean, it sounds a little bit like minority report the idea of pre-crime. And, in fact, it is. But this is an issue that society's going to have to face because what if I could tell you, Kojo, that one of your listeners is -- and I know which one -- is going to commit murder with a 98 percent probability? Well, would I be remiss from taking that person off the street to -- for the purpose of public safety or would I be denying that person his free will because he might use that 2 percent margin to -- that -- where he's not predicted to do it to exercise moral choice and to put down the knife?
CUKIERWe -- the criminal justice system doesn't have any experience at that but we, as the stewards of big data have to come to some decisions.
NNAMDIWe have found the joker but don't know exactly why he does what he does. We're going to take a short break. When we come back we will continue our conversation on big data. You can join it by calling 800-433-8850. With all the information about you out there, do you still worry about your privacy or have you given up, thrown your hands up in the air, 800-433-8850? I'm Kojo Nnamdi.
NNAMDIWelcome back. We're talking big data with Kenneth Cukier. He is co-author of the book "Big Data: A Revolution That Will Transform How We Live, Work and Think." He's a data editor for the Economist. And Viktor Mayer-Schonberger. He is co-author of the book. He's a professor of internet governance and regulation at Oxford University and the author of "Delete: The Virtue of Forgetting in the Digital Age."
NNAMDIWhen we talk about big data we can say that human history has seen other major leaps forward in terms of information. But you see this as a coming revolution as on par with the invention of the printing press or the internet. Can you explain, Viktor?
MAYER-SCHONBERGERYes, absolutely. The printing press changed dramatically who had -- who was able to communicate with others on a mass basis. The internet gave us that ability, not just on a one-to-many but on a many-to-many base. Many hundreds of millions of people now take part in the internet and share their views and their sentiments. But what big data does is it takes what the printing press does, namely spread the power of information. And it spreads the power of data.
MAYER-SCHONBERGERData now has a lot of unhidden value that we now have the tools to uncover and to share for our benefit and for society's benefit.
NNAMDIWhat's the potential for all this information, Ken?
CUKIERWhat's the potential? There's a lot to play for. You can imagine that it's going to actually change the way that businesses run. They'll find their most precious asset might not be what they're actually building but the data that goes into it because they can learn from it and cross apply it to other things. It's going to change how we educate our young people, both in terms of what they should learn, thanks to statistics and computer science, hard topics, not the humanities but also the way that we learn.
CUKIERFor example, teachers tend to rank students with a score of 75, 85, 90, etcetera and they treat it as a blunt instrument. What they don't do is look at it from a data-driven approach. For example, imagine an algebra teacher would find out that 60 percent of her students got the same question wrong with the exact same answer. She would therefore learn that in fact maybe she taught the algebra wrong, that maybe she wasn't clear enough. That they thought that they could invert a sequence A B or B A but actually the sequence mattered that they did the process.
NNAMDIOr they all copied from the same student.
CUKIERWell, there's that too. And the good news about big data is that we're learning to detect cheating. Because people don't get answers wrong in a pattern. They get answers wrong in a randomized way. When you start seeing a pattern in terms of erasures or -- and fixes to questions, or where the people get certain questions wrong and correct, whether they're hard or they're difficult -- this is classic in the SAT. They do these sorts of things today -- you can actually identify cheating.
CUKIERIt leave fingerprints. These are fingerprints that is invisible to the naked eye but a computer can detect with big data analysis.
NNAMDIYou coined the term datafication in the book and you differentiated from digitalization. Can you explain the difference, Viktor or Ken?
MAYER-SCHONBERGERThere's a way to look at it in terms of everyday usage. So imagine I took a book and I scanned it. I would have an image of it and that would be a digital image. And I get some of the benefits of digitization from that. I can store it cheaply, I can send it quickly and etcetera. And that's good but that's still just a scanned image. I can't actually do anything with the text. It's like a photograph but it's now digital. It's with bits and it's not analog.
MAYER-SCHONBERGERBut what if I was to actually extract the letters from each one through optical character recognition? This is how Google's book scan project worked. And now I can treat the actual words as data. What can I do with it? Well, I might be able to learn different trends in terms of writing over time. I'd learn that sentence structure changes and that might be interesting. But you could -- there's more to play for.
MAYER-SCHONBERGERSo one thing that some researchers have done at Harvard is they looked at dramatic language writings in the 20th century. And they found the incidences of certain artists and intellectuals in the 1930s were very prominent. And then they suddenly stopped -- started to disappear from the language. And they came back up in the 1940s. And what they had found, if you will, were the fingerprints of censorship.
MAYER-SCHONBERGERThe artist in this instance was Marc Chagall. He was a Jewish artist and you could actually see the trail that was left behind of censorship in the text. But this is something that you wouldn't otherwise see. You needed big data analysis to treat words as data and learn from it this way.
NNAMDIOn to the telephones. We will start with Neal in Tyson's Corner, Va. Neal, you're on the air. Go ahead, please. Hi Neal, are you there?
NEALYeah, it's a very interesting subject today. I have experience with this. And when it comes to big data it's interesting to hear it on the radio. I wanted to comment that I think quite a bit of the privacy concerns that are out there might have some basis, but may be overblown. When it comes to the process of big data there's really nothing new. This has been something that we've been able to do since the '70s, of breaking up data. And processing it in the way we do is not a new thing.
NEALBut the idea that we could possibly invade someone's privacy, while it is possible I think there are still a lot of steps that are in place that prevent that. So my comment is that in practice I think it's something that a lot of people aren't aware of, but we don't actually have as much data, particularly identifiable -- personal identifiable information that would connect back to a person. And then in other words, we would never know if it was dealing with Kojo Nnamdi. We would be user one, two, three, four, five.
NEALSo the level of data and the level of privacy that we're seeing on the internet for instance, might be somewhat overblown. And I would actually push forward that individual phishing and scams are more dangerous than what you might be concerned of is happening in a company like Google.
NNAMDIWell, allow me to get a response because, Viktor, you say the predictive possibilities can mean the end of privacy as we know it. Can you explain as you respond to Neal?
MAYER-SCHONBERGERYes, of course. There was a very dramatic case a few years ago. Internet company AOL released hundreds of thousands of search terms to the research community. But these search terms were carefully de-identified. They took out the name, they took out the IP address and so forth and -- of the people who sent the search terms to AOL so that the researchers got a de-identify, an anonymous set of search terms. But within two days reporters for the New York Times were able to re-identify a person, namely a 62-year-old widow in rural Georgia just based on the search terms that she sent to AOL and she searched on the internet for.
MAYER-SCHONBERGERWhat this tells us is that as we have so much data and so many different data sources available, by combining them, we can easily re-identify even the most anonymized data that you can think of and thereby expose people and violate their privacy. So in that sense, big data -- the cases show -- and there's another one for Netflix where Netflix opened up it's anonymized data and researchers were able to oust a closet lesbian in the Midwest. And she sued then Netflix for millions of dollars as a result.
MAYER-SCHONBERGERAll this tells us that there is a real privacy problem, a real privacy challenge out there that big data is causing, or is responsible for. And that we need to react to that.
CUKIERYeah, Neal had a second point as well and that was whether this is something that's new or not. And his point was that it wasn't. And I don't think that's right. I actually think there is something new that's happening here and there's two elements to it. The first one is that it is true that some of these techniques have been very old techniques. But what's new about it is that in the past it might've taken -- in the case of the 1870 census it took about eight years to complete. You might consider that a big data project but it took eight years, okay.
CUKIERWhat's happening now is we can process that same amount of information really in milliseconds at almost zero -- at effectively zero cost. So that incredible change of scale has led to a change of state. That's one of the dimensions of why what we're handling is relatively new for us, because we can do it so much more and we can apply it to so many more things. But there's a second thing as well. And that is that in the past when we had a limited amount of data, things -- some of these techniques didn't work very well. But now that we have more data they do.
CUKIERSo an example would be the self-driving car or, if you will, a machine translation, when you're on the internet and trying to translate something. When we had a limited amount of data we applied these same techniques. And the results were kind of mediocre. But we found that when you increase the scale of data by orders of magnitude suddenly these techniques really blossomed. And now they're working very well.
NNAMDIYet we are only at the dawn of this information revolution. Computer processing power is growing by leaps and bounds but nevertheless scientists do bump up against limitation in many areas, do they not?
MAYER-SCHONBERGERThey always will, ever thus.
NNAMDINeal, thank you very much for your call. We move on to Diane in Laurel, Md. Diane, you're on the air. Go ahead, please.
DIANEHi. I'd like to go back a little bit to the beginning of the show when he was talking about the cars and...
NNAMDIYes, the self driving cars, yes.
DIANEMy question to you is this, if you could possibly answer it. I know in the movies they do a lot of computerized tricks and stuff, but some of the stuff that you see with the cars that can drive themselves and do that stuff, I think possibly there may be companies that do actually have kind of self-driving robotic cars. And I was wondering where do you see that in the future where you have a lot of companies that -- you know, private companies that will be selling those types of cars. And the government will be controlling it on the highways. Is that where you see it going?
NNAMDII got to tell you, we've been discussing this on the show and we've been told 10 to 15 years without a doubt, Ken.
CUKIERYeah, okay. So technology evolves but not in ways that we expect. So we tend to think of self-driving cars but thinking of cars as the same thing that we're using today. You could imagine it being different. An example would be, I would think that we're going to see self-driving vehicles in rural Australia in the outback going to mining sites where it takes 20 hours to drive a truck to, picking up the raw zinc from the soil and then driving it another 20 miles to the -- 20 hours' worth of travel time to a processing plant.
CUKIERRight now it takes a human being to do that. There's absolutely nothing in the desert there in rural Australia where you have these big mining sites. And that sort of vehicle is where we will first see this application. So when we think of self-driving cars, on one hand we're going to be seeing it in different places a lot sooner than the five and ten years horizon. It's going to happen much faster for these applications.
CUKIERHowever, we're already going to have quasi partially self-driving cars today. Luxury models of certain vehicles have something called driver assist, and what that means is that when you're in a very crowded space in an urban setting, the car sort of takes control for you and swerves in and out at a very low speed, but in a very controlled motion in tight areas, that's hard for a human being to drive, but good for a car.
CUKIERYou're still hold onto the steering wheel, in fact, you might even be careful, but you have to let your fingertips touch it lightly, because it's going to be moving for you, helping you along.
NNAMDIWe got an email from Jim in Silver Spring. "I read in the Post yesterday that by 2014, new cars will be required to have devices that capture all sorts of info about their drivers, way beyond just mileage and speed. Why, and who gets to share that info?" Viktor?
MAYER-SCHONBERGERWell, the why answer is easy. This is extremely valuable information. It's extremely valuable information because we can mine it, we can extract new insights from it. We can learn when cars break down, what kinds of models of cars break down more frequently, what kind of engine parts need to be changed, when -- what kind of cars are more accident prone, what kind of streets are more crowded. All these kind of things can be mined and taken out of telemetry information, and information -- sensor information that is being collected by the car.
MAYER-SCHONBERGERIn fact, modern cars today have almost three dozen computers that are networked inside. So there's lot of censors, there's lots of data that they collect, and that can then be exposed. The real question is who gets to benefit from it? Who gets to have the money in the pocket, and the question is, is it going to be the consumer, is it going to be the individuals, is it going to be really well positioned intermediaries who are just reaping all the benefits without much of the cost, or is it going to be some regulation that comes in and redistributes some of the benefits?
NNAMDIGot to take a short break. When we come back, we'll continue this big data conversation with Kenneth Cukier and Viktor Mayer-Schonberger. They are co-authors of the book "Big Data: A Revolution That Will Transform How We Live, Work, and Think." If you have called already, stay on the line. We'll get to your calls. We still have a few lines open at 800-433-8850, or you might want to send us an email to email@example.com. I'm Kojo Nnamdi.
NNAMDIWelcome back. We're talking with Viktor Mayer-Schonberger. He is co-author of the book "Big Data: A Revolution That Will Transform How We Live, Work, and Think." He's a professor of internet governance and regulation at Oxford University. He's also the author of "Delete: The Virtue of Forgetting in the Digital Age." Kenneth Cukier is his co-author in the book "Big Data," and his is the data editor for the Economist. You can call us 800-433-8850. Allow me to go to Steven in Washington D.C. Steven, you're on the air. Go ahead, please.
STEVENThank you, Kojo, for taking my call.
STEVENI feel like this -- we're in the midst of a new gold rush, and what's happening, instead of mining gold, we're mining information. The Chinese are trying to mine information from us, we're trying to mine information from Iran the Russians and back and forth. The more data we have, the more the technology is moving ahead with hackers to get information. We seem to have more people trying to be under the radar, not being on Facebook, trying to not let companies know the data that they have. For example, insurance companies, the red lining, if your --
NNAMDIIf you have a precondition or a predisposition to some illness.
STEVENYes. Exactly. So you would have that and perhaps -- and here, you know, we're going back to like the Minority Report and different things, books that people have written about that people being who they are, and power brokers being who they are, data -- and the more it seems that people know about our thoughts and what information we have, what your credit rating is, insurance companies charging you more for auto insurance because your credit rating might be...
NNAMDIWhat's your concern, Steven? How this information will be used?
STEVENYeah. What -- the concern I have is -- is with all this information, and the ability for this information to be hacked and then used against...
NNAMDIWell, allow me...
STEVEN... (unintelligible) against our country.
NNAMDIAllow me to have Viktor respond. Victor, you say that we need new tools and new thinking to tap into all that data. Can you explain whether that has anything to do with Steven's question?
MAYER-SCHONBERGERYes, absolutely. Thank you very much Steven for pointing out a particular aspect of the challenge. The problem is that if we go and use big data analysis to predict future behavior, and then start penalizing people, then what we are doing is to use a tool that can only see correlations. That it can only see what, and apply it for a causal purpose -- to a causal purpose, and use it to infer the why, the guilt of a person.
MAYER-SCHONBERGERSo this is not necessarily a problem of the big data tool, it is a problem of the use of big data, and we must be vigilant as a society to not misuse or abuse big data that way.
NNAMDIWe got an email from Mike in Beltsville who says, "Regarding personal information, I once heard a commentator say the day would come when everything knowable about a person would be available freely. He said when that day arrived people's reaction would be so what? How can you blackmail me if all my secrets are no longer secrets," to which you say what, Ken?
CUKIERI say the day's never going to come. That is preposterous.
CUKIERWell, because on what presumption do we think we can know everything about a single individual? The data is only a simulacrum of reality. It is not the real thing, firstly. We'll never have all the data. That's not actually possible. The data -- we have to -- though there's values within the data for what we actually collect, in the same way if you will, that a map is not the territory. So this sort of hypothetical thought experiment of what will happen when we know everything about everyone, that day is just not going to happen. So on a practical level, to this about it and to get wound up in knots doesn't seem useful.
NNAMDIThank you very much for your call, Steven. In the past, most analysis relied on samples because we didn't have, or simply couldn't process all the information, but now what are we able to do?
MAYER-SCHONBERGERWe are able to look at a phenomenon with must more data available. So give you an example, for many years, there was a rumor in Japan that sumo wrestlers taking part in the national sports of Japan were taking part in much fixing, but they could never find out how and why. They did a sample, and they didn't show anything. And then they took all the data of ten years of sumo matches and ran an analysis and suddenly found that match fixing was happening, the fraud was happening.
MAYER-SCHONBERGERBut not where everybody expected it, but somewhere else. And that is precisely the beauty, the value, the empower of big data. Not -- with the old -- in the old days, with small data, we had to come up with a hypothesis with a question, and then use the data to prove or disprove the question. Now, we can let the data speak. We can use the data to come up with hypothesis and to show us where stuff really is happening.
NNAMDIThe other side of that coin though, is that we may now have a lot of granular, detailed information, but big data is actually quite messy and quite imprecise. How do you explain that, if you will, contradiction?
CUKIERWell, it's not actually a contradiction, and the reason why is because you can apply this data for lots of different purposes. So let's think about something that we talked about earlier which is machine translation, how computers can translate different words from one to another. It turns out when we had a little bit of data and we used a process called statistical machine translation, which basically means we take some items of text that are in two different languages, say French and English, and we look for the statistical probability that a word in one language is the correct substitute, the best substitute for a word in another.
CUKIERThis is a great way of translating text because the alternative is, if you will, to look at -- to download a French-English dictionary and just presume a one to one correspondence of words, just doesn't really work. It just becomes laughable. So when IBM tried to do this very technique, and they applied it to the Canadian parliamentary transcripts that are in both French and English, it wasn't bad, but it wasn't great either.
CUKIERNow, the data was extremely good. It wasn't messy at all. It was a very highly curated, high quality data. Now, Google, many years later, about a decade later, marched on in, and instead of using just the Canadian parliamentary transcripts, they availed themselves of the entire worldwide web. That's corporate web sites, that's all EU document, European Union, of 21 different languages. They took books from their book scanning project that were translated in one language an another and applied it.
CUKIERNow, in the case of the book scanning project, when they tried to get which words were which, the words might not have been totally correct, because they had to take it optical character recognition and translate it from the dataized text -- excuse me, digitized text to data. However, even though the data was messier because they had much more data, they were able to do extremely good translations. It was far, far, better than IBM had when they had very clean data. So if you will, more and messy data trumps and less clean data.
NNAMDIOnto Christine in Washington D.C. Christine, you're on the. Go ahead, please.
CHRISTINEHi. I was wondering if you had any future projections on how this big data could be used in cyberspace by non-state or illicit state actors against the U.S. So, you know, further attacks, more specifically aligned with what you expect technology to be in five years from now or ten years from now.
MAYER-SCHONBERGERWell, the -- the power of big data is a power that any actor can utilize. It's like electricity or like antibiotics. If you have it, you can utilize it. You can use it. So what we're going to see in the future is the non-state actors attacking, or trying to attack the United States using big data analysis might be able to find weak spots in the infrastructure of the United States -- in the defense of the United States in cyberspace much more easily than currently. Right now they don't know where the weak spot is, and so they have to attack all kinds of different angles.
MAYER-SCHONBERGERWith big data analysis, they might be able to find that weak spot in the defense and go after that because big data is empowering them too.
NNAMDIOn -- and thank you for your call, Christine. Onto Dale in Falls Church, Va. Dale, you're on the air. Go ahead, please.
DALESure. Actually, Kojo, love the show. You always have great topics and thanks for taking my call. I actually have a question about when does a person lose ownership of their identity when it comes to this big data, the internet, you know. So a lot of sites now get into the practice of forcing people to give up pieces of their personal information.
NNAMDIViktor, your previous book was kind of about this. When does your personal data become big data?
MAYER-SCHONBERGERIn a way it becomes big data when you give it away to somebody else. But at the same time, what we have to be careful about in the big data age is that the danger is not that other people collect information about you, the danger is -- the potential danger is in what they use it for. If they use it for something bad, then that's highly problematic. If they use it for something good like predicting the flu for example for public health, that might be essential in case of a pandemic. It's the use, not the collection.
NNAMDIAnd I think there's enough concern about invasion of privacy that when you talk about how we'll have to look at big data in a different way, that that's going to be one of major issues that comes up. But Viktor, in more specific terms, how might this change how people with particular expertise, say a doctor, might do his or her job in the future?
MAYER-SCHONBERGERWe think that with big data in the future, no professional, no expert will be able to render a decision that isn't based on big data empirics, that isn't based on big data analysis. In fact, I'll foresee that five years down the road if you go to a doctor, and the doctor says oh, Kojo, you have the flu, you say show me the data, and if he can't produce the data, then you should switch your doctor.
NNAMDIWell, on the issue the privacy, Ken, you point out that the danger now shifts from privacy over our personal information to what all the data might predict about our future. What do you mean about that?
CUKIERSure. So privacy is still a big problem in the big data age, but a new issue is prediction or propensity. That is to say that an algorithm believes that we are susceptible or likely to do something, and we're penalized for it before we've actual three committed the crime. So this does sound like Minority Report because it sort of is. The idea is that big data is going to be about taking lots of different variables, and although today we are, say, giving a credit score or someone based on a few variables that are well known in advance and explicitly decided by humans, and also applies to a group, tomorrow a lot of these same things change.
CUKIERIt's going to be highly individualized. It may be based on a thousand variables. We may not be able to explain it quite as much because it's a bit of a black box that algorithms. And the troubling aspect to this is that we may find that we are going to be imprisoned by these predictions. It will believe that we have a 99 percent likelihood to shoplift, and we might have an intervention like a door knock from a social working, possibly an arrest warrant from the police, and we will not have actually yet committed the crime, yet we will not have let fate play out to find out if I exercise moral choice and didn't shoplift what I could have.
NNAMDIYou talk about the secondary uses of information. What does that mean?
MAYER-SCHONBERGERSure. So we have collected use information for its primary purpose. Say if you're a cell phone carrier you knee where people are for the primary purpose of routing telephone calls. But in a big data age, the value shifts also to the myriad secondary uses that you can put the information towards. So in this case, you can imagine that once you know where everyone is with their smartphones in terms of GPS, then all you -- you now can actually target on advertisement to them, for example, a free Starbucks coffee because they're walking past Starbucks, something like that, because you've applied the same data that you've used for one thing, routing the telephone call, therefore knowing where the telephone subscriber is, to something else, in this case advertising.
NNAMDIAnd I'm afraid we're just about out of time. We are talking with the co-authors of the book, "Big Data: A Revolution That Will Transform How We Live, Work, and Think." Kenneth Cukier the data editor for the Economist. Ken, thank you so much for joining us.
CUKIERThank you, Kojo.
NNAMDIViktor Mayer-Schonberger is a professor of internet governance and regulation at Oxford University. He's also the author of the book "Delete: The Virtue of Forgetting in the Digital Age." Viktor, thank you for joining us.
NNAMDI"The Kojo Nnamdi Show" is produced by Brendan Sweeney, Michael Martinez, Ingalisa Schrobsdorff, Tayla Burney, Kathy Goldgeier, and Elizabeth Weinstein, with help from Stephannie Stokes. Our engineer, of course, Tobey Schreiner. Natalie Yuravlivker is on the phones. Podcasts of all shows, audio archives, CDs and free transcripts are available at our website, kojoshow.org. Thank you all for listening. I'm Kojo Nnamdi.
Most Recent Shows
The DC Trust has declared bankruptcy, leaving more than 70 groups that relied on its funding with questions about what went wrong and what happens next.
After another smoke incident and ongoing single tracking delays for fixes, U.S. Secretary of Transportation Anthony Foxx replaced three Metro board members with safety experts, while a Maryland Congressman introduced legislation which would require the next three federally appointed Metro board members have relevant expertise.
Kojo talks to the lawyer representing a Virginia teen who sued his school over a rule banning him from using the boys' restrooms.