Dr. Tyrone Grandison on Data, Privacy and Security

State of Cybercrime

English - September 11, 2017 08:00 - 35 minutes - 32.4 MB - ★★★★★ - 48 ratings
Technology Business Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: When Hackers Behave Like Ghosts

Next Episode: Ofer Shezaf, Varonis Director of Cyber Security, Part I

Dr. Tyrone Grandison has done it all. He is an author, professor, mentor, board member, and a former White House Presidential Innovation Fellow. He has held various positions in the C-Suite, including his most recent role as Chief Information Officer at the Institute of Health Metrics and Evaluation, an independent health research center that provides metrics on the world's most important health problems.

In our interview, Tyrone shares what it’s like to lead a team of forty highly skilled technologists who provide tools, infrastructure, and technology to enable researchers develop statistical models, visualizations and reports. He also describes his adventures on wrangle petabytes of data, the promise and peril of our data economy, and what board members need to know about cybersecurity.

Transcript
Tyrone Grandison: My name is Tyrone Grandison. I am the Chief Information Officer at the Institute for Health Metrics and Evaluation, IHME, at the University of Washington in Seattle. And IHME is global in profit in the public health and population health space, where we're focused on how do we get people to have a long life and have that long life at the highest health capacity possible.

Cindy Ng: Often times, the bottom line drives businesses forward, where your institute is driven by helping policy makers and donors determine how to help people live longer and healthier lives. What is your involvement in ensuring that that vision is sustained and carried through?

Tyrone Grandison: Perfect. So I lead the technology team here, which is a team of 40 really skilled data scientists, software engineer, system administrators, project and program managers. And what we do is that we provide the base, the infrastructure. We provide tools and technologies that enable researchers to, one, ingest data. So we get data from every single country across the world. Everything from surveys to censuses to death records. No matter how small or poor or politically closed a country is. And we basically house this information. We help the researchers develop statistical models. Like, very sophisticated statistical models and tools on them that make sense of the data. And then we actually put it out there to a network of over 2,400 collaborators.

And they help us produce what we called the Global Burden of Disease that, you know, shows what in different countries of the world is the predominant thing that is actually shortening lives in particular age groups, for particular genders and all demographic information. So, now people can, if they wanted to, do an apples-to-apples comparison between countries across ages and over time. So, if you wanted to see the damage done by tobacco smoking in Greece and compare that to the healthy years lost due to traffic injuries in Guatemala, you can actually do that. If you wanted to compare both of those things with the impact of HIV in Ghana, then that's now possible. So our entire thing is, how do we actually provide the technology base and the skills to, one, host the data, support the building of the models and support the visualization of it. So people can actually make these comparisons.

Cindy Ng: You're responsible for a lot and let's try to break it down a bit. When you receive a bunch of data sets from various sources, take me through what your plan is for it. Last time we spoke, we spoke about obesity. Maybe is that a good one to, that everyone can relate to and with?

Tyrone Grandison: Sure. So, say we get a obesity data sets from either the health entities within a particular country. It goes through a process where we have a team of data analysts look at the data and extract the relevant portions of it. We then put it into our ingesting pipeline, where we then vet it. Vet it in terms of what can it apply to. Does it apply to specific diseases? Obviously, it's going to apply to a specific country. Does it apply to a particular age group and gender? From that point on, we then include it in models. And we have our modeling pipeline that does everything from estimating the number of years lost from obesity in that particular country. Also, as I mentioned before, it actually sees if that particular statistic that we got from that survey is relevant or not.

From there, we basically use it to figure out, okay, well what is the overall picture across the world for obesity? And then, we visualize it and make it accessible. And provide people with the ability to tell stories on it with the hope that at someone point, a policymaker or somebody within the public health institute within a particular country is gonna see it and actually use it in their decision making in terms of how to actually improve obesity in their particular country.

Cindy Ng: And when you talk about relevant and modeling, people say in the industry that there is a lot of unconscious bias. How do you reconcile that? And how do you work with certain factors that people think is controversial? For instance, people have said that using a body mass index isn't accurate.

Tyrone Grandison: That's where we actually depend a lot on the network of collaborators that we spoke about. Not only do we have like a team that has been doing epidemiology and can advance the population health metrics for, you know, over two decades. We do depend upon experts within each particular country once we actually produce, like, you know, the first estimates based upon the initial models to actually look at these estimates and say, "Nope. This does not make sense. We need to actually adjust your model to add a factor in, that same unconscious bias." Or, to kind of remove that the model says that we're seeing but that the model may need to be tweaked or is wrong about. It all boils down to having people vet what the models are doing.

So, it's more along the lines of how do you create systems that are really good at human computation. Marrying the things that machines are good with and then putting in a step there that forces a human to verify and kind of improve the final estimate that you want to actually want to produce.

Cindy Ng: Is there a pattern that you've seen over time where time and time again, the model doesn't count for X, Y and Z? And then, the human gets involved and then figures out what's needed and provides the context? Is there a particular concept or idea that you've seen?

Tyrone Grandison: There is. And there is to the point where we basically have included it in our initial processing. So, there is this concept, right. The idea of a shock. Where a shock is an event that models cannot predict and it may have wide ranging impact essentially on what you're trying to produce. So, for example, you could consider the earthquake in Haiti as a shock. You could consider the HIV epidemic as a shock. Every single country in any one given year may have a few shocks depending upon what the geolocation is that you're looking at. And again, the shocks are different and we are really grateful to the collaborative network for providing insight and telling us that, "Op, this shock is actually missing from your model for this particular location, for this particular population segment."

Cindy Ng: It sounds like there's a lot of relationship building, too, with these organizations because sometimes people aren't so forthcoming with what you need to know.

Tyrone Grandison: So, I mean, it's relationship building over the work that we've been doing here has been going on for 20 years. So, imagine 20 years of work just producing this Global Burden of Disease. And then, probably another decade or two before that just building the connections across the world. Because our Director has been in this space for quite a while now. He's worked at everywhere from WHO to the MIT doing this work. So, the connections there and the connections from the executive team have been invaluable in making sure that people actually speak candidly and honestly about what's going on. Because we are the impartial arbiters of the best data on what's happening in population health.

Cindy Ng: And it certainly really helps when it's not driven by the bottom line. It's the most important thing is to improve everyone's health outcome. What are the challenges of working with disparate data sets?

Tyrone Grandison: So, the challenge is the same everywhere, right? The set challenges all relate to, okay, well, are we talking about the same things? Right. Are talking the same language? Do we have the same semantics? Basic challenge. Two is, well, does the data have what we need to actually answer the question? Not all data is relevant. Not all data is created equal. So, just figuring out what is gonna actually give us insight into, you know, the question as to how many years do you lose for a particular disease? And the third thing which is pretty common to, you know, every field that is trying tot push into the data open data areas. Do we have the right facets in each data set to actually integrate them? Does it make sense to integrate them at all? So, the challenges are not different from what the broader industry is facing.

Cindy Ng: You've developed relationships for over 20 years. Back then, we weren't able to assess so many different, I'm guessing billions and trillions of data sets. Have you seen the transition happen? And how has that transition been difficult? And how has it made your lives so much better?

Tyrone Grandison: Yeah. So, the Global Burden of Disease actually started on a cycle that was, you know, when we had considered we had enough data to actually make those estimates, we would actually produce the next Global Burden of Disease. Right, and we just moved starting this year to an annual cycle. So, that's the biggest change. The biggest change is because of the wealth of data that exists out there. Because of the advances of technology, now we can actually increase the production of this data asset, so to speak. Whereas before, it was a lot of anecdotal evidence. It was a lot of negotiation to get the data that we actual need. Now, in other far more open data sets. So, lots more that's actually available.

A willingness due to prior past demonstrations of the power of home data for governments and people to actually provide and produce them, because they know that they can actually use them. It's more of the technology hand-in-hand with the cultural change that's happened. That's been the biggest changes.

Cindy Ng: What have you learned about wrangling petabytes of data set?

Tyrone Grandison: A lot. In a nutshell, it's very difficult and if I was to say that I give advice to people, I would start with, so what's the problem you're trying to solve? What's the mission you're trying to achieve? And figure out what are the things that you need in your data sets that would help you answer that question or mission. And finally, as much as possible, stick with standardize and simplify kind of methodology. Leverage a standard infrastructure and a standard architecture across what you are doing. And make it dead simple because if it's not standard or simple, then getting to scale is really difficult. And scale meaning processing tens, hundreds of petrabytes worth of data.

Cindy Ng: There are a lot of health trackers, too, where they're trying to gather all sorts of data in hopes that they might use it later. Is that a recommended best practice approach for figuring your solution or the problem out? Because, you know, what if you didn't think of something and then a new idea popped into your head? And then there's a lot of controversy with that. What is your insight...

Tyrone Grandison: A controversy is, in my view, actually very real. One, what is the level of data that you are collecting, right? So, at IHME, like, we're lucky to be actually looking at population level data. If you're looking at or collecting individual records, then we have a can of worms in terms of data ownership, data privacy, data security. Right. And, especially in America, what you're referring to is the whole argument around secondary use of health data. The concern or issue is just like with HIPAA, the Healthcare Information Portability and Accountability Act. You're supposed to just have data for one person for a specific purpose and only that purpose. The issue or concern, like, you just brought up is, one, a lot of companies actually view data that is created or generated on the particular individual as being their own property. Their own intellectual property. Which you may or may not agree with.

At some point, there's no tack list that says the person who this data is about should actually have a say in this in the current model, the current infrastructure. Right. And I can just say it like, personally, I believe that if the data is about you, that data's created by you, then technically you should own it. And the company should be good stewards of the data. Right. Being a good steward simply means that you're going to use the data for the purpose that you told the owner that you're going to use if for. And that you will destroy the data after you finish using it. If you come up with a secondary use for it, then you should ask the person again, do they want to actually participate in it?

So, the issue that I have with it is basically is the disenfranchisement of the data owner. The neglection of like consent or even asking for it to be used in a secondary function or for a secondary purpose. And the fact that there are inherent things in that scenario with that question that are still unresolved and are just assumed to be true that people just need to look at.

Cindy Ng: When you say when the project is over, how do you know when the project is over? Because I can, for instance, write a paper and keep editing and editing and it will never feel completed and done.

Tyrone Grandison: Sure. So, it's... I mean, put it this way. If I say to the people that are involved in a particular study or that gave me their data, that I want to use this data to test a hypothesis and the hypothesis is that drinking a lot of alcohol will cause liver damage. Okay, obvious. And I, you know, publish my findings on it. It gets revised. You know, that at the very end, there has to be a point where either the papers published in the journal are somewhere or not. Right. I'm assuming. If that's the case and, you know, I publish it and I found out that, hey, I can actually use the same data to actually figure out the affects of alcohol consumption on some other thing. That is a secondary purpose that I did not have an agreement with you on, and so I should actually ask for your consent on that. Right.

So, the question is just not when is the task done, but when have I actually accomplished the purpose that I negotiated and asked you to use your data for.

Cindy Ng: So, it sounds like that's the really best practice when you're gathering or using someone's personal data. That that's the initial contract. If there is a secondary use that they should also know about it. Because you don't want to end up in a situation like Henrietta Lacks and they're using your cells and you don't even know it, right?

Tyrone Grandison: Yup. But Henrietta Lacks actually is like a good example. It highlights what the current practices of the industry. Right. And again, luckily published health does not have this issue because we have aggregated data on different people. But like in the general healthcare scenario where you do have individual health records, what companies are doing and what they did within, in the Henrietta Lacks case was they may have actually specified in some legal document that, "Hey, we're gonna use your information for X, and X is the purpose." And they make either X so broad, so general that in encompasses like every possible thing that you can imagine. Or, they basically say, "We're going to do a really specific purpose and anything else that we find." And that is now the common practice within the field. Right?

And to me, the heart of that is very, seems very deceptive. Right. Because you're saying to somebody that, you know, we have no idea what we're going to do with your data, we want access to do it and, oh, we assume that you're not going to own it. That we assume that any profits or anything that we get from it is going to be ours. Do you see the model itself just seems perverse? It's tilted or veered towards how do we actually get something from somebody for free and turn it into a asset for my business. Where I have carte blanche to do what I want with it. And I think that discussion has not happened seriously by the healthcare industry.

Cindy Ng: I'm surprised that businesses haven't approached your institution in assisting with this matter.Well, just it sounds like it would make total sense because I'm assuming that all of your data perhaps might have all the names and PHI stripped.

Tyrone Grandison: We don't even get to that level at this point.

Cindy Ng: Oh, you don't even...

Tyrone Grandison: It's information on a generalized level. So there are multiple techniques that you can actually use to, let's say, protect privacy for people. Like, one, would be just suppression. Okay, so I suppress the things that I call or consider PII. Or the other is like generalization. Right. So, it's basically, I'm going to look at or get information that is not at the most granular level. But it's at the level above it. Don't look like you and all your peers. You just go a level above this and say, "Okay. Well, let's look at everyone that lives in a particular zip code or a particular state or country." So, that way, you have protection from hiding in a crowd. So, you can't really identify one particular person in a data set itself. So, at IHME we don't have the PHI/PII issue because we work on generalized data sets.

Cindy Ng: You've held many different roles. You've been a CDO, a CIO, a CEO. Which role do you enjoy doing most?

Tyrone Grandison: So, any role that actually allows me to do two things. Like, one, create and drive the direction or strategy of an organization. And, two, enables me to help with the execution of that strategy to actually produce things that will positively impact people. The roles that I have been fond of so far would be CEO and CIO because at those levels, you basically also get to set what the organizational culture is, which is very valuable in my mind.

Cindy Ng: And since you've also been a board member, what do you think the board needs to know when it comes to privacy in cyber security?

Tyrone Grandison: First of all, I think it should be an agenda item that you deal with upfront and not after a breech or an incident. It should be something that you bake into your plans and into the product life cycle from the very beginning. You should be proactive in how you actually view it. The main things I've actually noticed over time is just like, people do not pay attention to privacy, cyber security, cyber crime until, you know, after there is a... This is a horrible analogy but until there's a dead body in the sea. What happened? And then you start having reputational damage and financial damage because of it.

When, you know, thinking about the process technology, people and tools that would actually help you fix this from the very get-go would have actually saved you a lot of time. And, you know, the whole perception, not perception, but the whole thought of both of these things, privacy and security, being cost centers, you don't see a profit from them. You don't see revenue being generated from them. And you only actually see the benefit, the cost savings, so to speak, after everyone else has actually been breached or damaged from an episode and you're not. Right. Yeah. It's a little bit more proactive upfront rather than reactive and, you know, post-fact.

Cindy Ng: But do you also think that it's been said that IT make technology now more complicated than it really is? And they're unable to follow what the IT presenting and so they're confused, and there's not a series of steps you can follow? Or maybe they asked for a budget for the one thing one year and then want some more money next year. And as you said, it costs money. But do you also think that there's a value proposition that's not carried across in a presentation? How can the point be driven home then?

Tyrone Grandison: So, I mean, the biggest thing you just identified a while ago is the language barrier. The translation problem. So, I don't fundamentally believe that anyone tech or otherwise is purposely trying to sound complex. Or purposely trying to confuse people. It's just a matter of, you know, you have skilled people in a field or domain. Whatever the domain is. So, if you went tomorrow and started talking to a oncologist or a water engineer, and they just went off and just uses a bunch of jargon from their particular fields. They're not trying to be overly complex. They're not trying to not have you understand what they're doing. But they've been studying this for decades. And they're just, like, so steeped in it that that's their vocabulary.

So, the number one issue is just that, one, understanding your audience. Right. If you know that your audience is not tech or is from a different field or a different era in tech or is the board, and understanding the audience and knowing what their language is and then translating your language lingo into things that they can understand, I think that would go a long, long way in actually helping people understand the importance of privacy and cyber security.

Cindy Ng: And we often like to make the analogy of that we should treat data like money. But do you think that data can be potentially be more valuable than money when the attacks aren't deterrent financially driven then they're out to destroy data, instead? We react in a really different way, I wanted to hear your thoughts on the analogy of data versus money.

Tyrone Grandison: Interesting. So, money is just a convenient currency. Right. To enable a trade. And money has been associated with giving value to certain objects that we consider important. So, I'm viewing data. And data as something that needs to have a value assigned to it. Right. Which money is going to be that medium. Right. Whether the money is actual physical money or it's Bitcoin. So, I don't see the two things being in conflict. Or the two things having a comparison between value. I just think that data is valuable. A chair is valuable. A phone is valuable. Money is just, like, that medium that allows us to have one standard unit to compare the value between all those things.

Is data going to be more valuable than the current physical IT assets that a company has? Overtime, I think, yes. Because the data that you're using, that you're hopefully going to be using is going to be driving more, one, insights. More, hopefully, revenue. More creative uses of the current resources. So, the data itself is under influence how much of the other resources that you will actually acquire or how much of the other resources you need to place in particular spots or instances or allocate across the world. So, I see data as a good driving force to making these value driven decisions. So, I think the importance of it versus the physical IT assets is going to increase over time. You can see that happening already. To say data is more valuable than cash. I'm not too sure that's the right question.

Cindy Ng: We've talked about the value of data, but what about the data retention and migration? It's sort of dull, but yet so important.

Tyrone Grandison: Well, multiple perspectives here. Data retention and migration is important for multiple reasons. Right. And the importance normally lies in risk. In minimizing the risk or the harm that can potentially be done to the owner or the data, or the subjects that are referenced too in the data sets. Right. That's all the importance. That's why you have whole countries, states actually saying that they have a data retention policy or plan. And that means that after a certain time, either the stuff has to be gone, completely deleted, or be stored somewhere that is secure and not well accessible.

And the whole premise of it is just like you assume for a particular period of time, that companies are going to need to use that data to actually accomplish a purpose that they specified initially, but then after that point, the risk or the potential harm of that becomes so high that you need to do something to reduce that risk. And that thing normally is a destruction or migration somewhere else.

Cindy Ng: What about integrating that data set with another, so probably a secondary use, but integrating it with other institutes? I hear that people want a one health solution in terms of patient data. So that all organizations can access it. It's definitely a risk. But is that something that you think is a good idea that we should even entertain it? Or we're going to create a monster and that the results of having a one single unit, a database where everything and all the data integrates is a bad solution? It's great for analytics and technology and use.

Tyrone Grandison: I agree with everything you just said. It's both. So, it's for certain purposes and scenarios, you know, is good. Because you get to see new things and you get a different picture, a better picture, a more holistic picture once you integrate data sets. That being said, once you get data sets, you basically also, you increase the risk profile of the results in data sets. And you lower the privacy of the people that are referenced in the data sets. Right. The more data sets you integrate...

So there's this paper that a colleague of mine, Star Ying and I wrote, like last year or the year before last. That basically says there's no privacy in big data. Simply because, like, big data you assume the three Vs. So, velocity, volume and variety. As you actually add more and more data sets in to get, like, a larger, just say, like a larger big data sets, as we call it. What you have happening is that you have the things that actually can be uniquely combined to identify the subject in that larger, big data set becomes larger and larger.

So, I mean, a quick, let me see what the quick example would be. So, if you have access to toll data, you have access to the data of, you know, people that are going on, you know, your local highway or your state highway. And you have the logs of when a particular car went through a certain point. The time, the license plates, the owner. All that stuff. So, that's one data set by itself. You have a police data set that had a list of crimes that happened in particular locations. And you pick something else. You have a bunch of records from the DMV that tell you when somebody actually came in to actually have some operations in. All by themselves very innocuous. All by themselves if you anonymized them, or put techniques on them to protect the privacy of the individuals. Perfectly. Okay. Perfectly safe. Right. Not perfectly but relatively.

If you start combining the different data sets just randomly. You combine the toll data with the police data. And you found out that there's a particular car that was at a scene of a crime where somebody was murdered. And that car was at a toll booth that was nearby, like, one minute afterward. Now you have something interesting. You have interesting insight. So that's a good case.

We want to actually have this integration be possible. Because you get insights that you couldn't get from just having that one data set itself. If you start looking at other cases where, you know, somebody wants to actually be protected, you have, and this is just within one data set, you have a data set of all the hospital visits across four different hospitals for a particular person. What you can do if you start merging them is that you can actually use the pattern of visits to uniquely identify somebody. If you start merging that with, again, the transportation records and that may be something that gives you insight as to what somebody's sick with. That may be used...

You can identify them first of all, which they don't want to do because they went to one hospital. And that would be used to actually do everything, something negative against him. Like deny them insurance or whatever the used case is. But you see, like in multiple different cases, the, one, the privacy of individuals that can hold the...is actually decreased. And, two, it can be used for, you know, positive or negative purposes. For and against the individual data subject or data owner.

Cindy Ng: People have spoken about these worries. How should we intelligently synthesize this information? Because it's interesting, it's worrisome. But it can be also be very beneficial. Because we tend to sensationalize everything.

Tyrone Grandison: Yup. That's a good question. So, I mean, I would say to look at the things the major decisions in your life that you plan to be making for the next couple of years. And then look at the tools, software, things that you have online right now that potential employer may actually look at. Then not employer but a potential person that you're looking could...to do something with, get a service from. May actually look at to evaluate whether you get the service or not. Whether it be getting a job or getting a new car. Whatever it is. Whatever that thing is that, you know, want to actually get done.

And you know, see if the current things, the current questions that the person on the other side will be asking and looking at. Would that be interpreted negatively on you? A quick example would just be, okay, you're a Facebook user and look at all the things that you do on there and all the kinda good apps that you have. And then look at who has access to all that. And in those particular instances, is that going to be a positive with that interaction or a negative with that interaction? I mean, I think that's just being responsible in the digital age, right?

Cindy Ng: Right. What is a project that you're most proud of?

Tyrone Grandison: I'm proud of a lot of things. I'm proud of the work that we do here at IHME. I think it's going breaking work that's gonna help a lot of people. The data that we produce have actually been used to do pollution legislation. And numbers come out. Different ministers see it. The Ministry in China saw it and said, "Oh, we have an issue here. And we need to actually figure out how do we actually improve our longevity in terms of carbon emission."

We've had the same thing Africa where there was somebody from the Ministry. I think it was, sorry, was it at Gambia or Ghana. I'll find out for you afterwards. And they saw the numbers from, like, deaths due to in-house combustion. And started a program that gave a few hundred, well, a few thousand pots to different households and within like a few years, I saw that number went down. So, literally saving lives.

I'm proud of the White House Presidential Innovation Fellows. That group of people that I work with two and a half years ago. The work that they did. So,one of the fellows in my group worked with the Department of Interior to increase the number of kids that were going to National Parks. And, you know, they did it by actually going out and talking to kids and figuring out, like, what the correct incentive scheme would be. To actually have kids come to the park when they had their summer breaks. And that program is called, like, Every Kid in the Park. And it's hugely successful about getting people, kids and parents like connected back into nature in life. Right. I'm proud of the work the commerce did of service team at the Department of Commerce. And that did help a lot of people.

We routinely just created data products with the user, the average American citizen in mind. And, like, one of the things that I'm really so proud of is that we helped them democratize and open up U.S. Census Bureau data. Which, you know, is very powerful. It's actually freely open to everybody and it's been used by a lot of businesses that make a lot of money from sending the data itself. Right. So we looked at and exposed that data through something called a CitySDK and, you know, that led to everything from people building apps to help food trucks find out where demand was. To people building websites to help accessibility channels people to figure out how to get around particular cities. To people helping supermarkets to figure out how to get fresh foods to communities that didn't have access to them. That was awesome to actually see.

The other thing was exposing the income inequality data and just like showing people that, like, the narrative that like people are hearing about the gender and the race inequality amongst different professionals is actually far worse than is actually mentioned out there in the public. So, I mean, I'm proud of all of it because it was all fun work. All impactful work. All work that hopefully helped people.

Twitter Mentions

@tyrgr