From The Federal Drive with Tom Temin:

Cybersecurity and data management are closely linked. That’s why many agencies are refining their strategies for gathering and managing large stores or lakes of network and other data in service of better cybersecurity. For ways to approach the cyber data problem, the Federal Drive with Tom Temin spoke with the senior fellow for cybersecurity and emerging threats at the R Street Institute, Bryson Bort.

Tom Temin: Given the fact that this data lake-type of technology or any kind of mass storage in the cloud is available, do agencies run the danger of collecting too much data such that it gets hard to identify and hard to sift through the needles you’re actually looking for in the haystack.

Bryson Bort: Yeah, I call that the NSA problem. We collect everything we can. And then you have the challenge of, with everything you have and as that builds up, how easy is it for me to answer the questions that I want to ask into that that data? Part of the challenge, of course, is I don’t always know what question I want to ask before I look through it. And so structuring that properly helps get you there quicker. But that’s not always realistic. Things change. We have different things that we’ve learned from the data itself. But yeah, that first part is we start creating very large haystacks. And, Tom, here’s the worst part: You talk about finding a needle in a haystack. The worst part is sometimes you’re looking through the haystack. And there’s no needle to be found.

Tom Temin: Yeah, that could be a lot of spinning wheels and hourglasses of death, I suppose, as you try to wait for an answer to come out. And what is good practice, first of all, for architecting a data lake nowadays? I don’t think anyone wants to invest in the type of storage hardware infrastructure that they might have in the 80s and 90s and 2000s.

Bryson Bort: Yeah, so first, the concept of a data lake is possible because of the more cross platform accessibility that we get with a with the cloud. I don’t just have to log into a particular server somewhere to access in a client server, retrieve-this-file approach. It’s more of me accessing the large crater that I filled all of the data into like a lake, which is where the term a data lake comes from. So what are some of those challenges? First, is the same problem that we’ve had since the 1980s: configuration management. What do I have? What is it? How do I categorize it? How do I maintain the status of it? There is status, there’s versioning to this. This gets into the problem, then, if I don’t have that ability to just maintain that status, I have problems with duplication. And I have problems with what is the current data. I’m looking at two same things that are different, which one is prime? And so I don’t get confused by history. Being able to assess that current infrastructure – what is the structure that looks best for that? So tying to the configuration management, the challenges, we’re doing this on something that already exists. There is a large beast of different data in different forms in different silos. And there’s no, of course, no common Rosetta Stone to being able to understand all of that, and even what’s out there. And so a typical approach is usually programmatically based — sometimes it can be department-based — where you’re going to go in and you’re going to try to encapsulate as much of that as possible, recognizing even then you’re not going to have caught everything, and establishing that process. So we go back, we find the things that are already there. And we’re establishing the process to identify the new things that’re gonna be created, so that we’re filling in the lake. And then maintaining the quality of the water, I guess, is our analogy here for that data in that lake.

Tom Temin: Well, you could swim through a lake and you can’t swim through a haystack. So maybe that’s some advantage.

Bryson Bort: We’re mixing metaphors here, just like real data problems.

Tom Temin: Pretty much, because the data is generated by network sensor devices, your various traffic types of controls: routers and switches and so forth. But also say in the purpose of fraud detection, which might be a cybersecurity clue, you’ve got transaction data from the systems deployed to the public or to other agencies. And so you’ve got many, many different formats of data coming in from many different database programs. Or maybe they’re not database programs, or just data thrown off in the course of a piece of equipment operating. What’s current best practice for rationalizing that, such that the data is searchable, having come from all these different sources?

Bryson Bort: So when I think of rationalized I think of what we can cut, and that’s always a challenge. Nobody ever wants to not have the data. The back end of that is data retention: How long do we keep particular data around? And there are liability questions that can tie into that. So it’s not quite a question, so much of rationalization as normalization. How am I taking disparate data sets that have different kinds — not everything is simple, like it’s just numerical? Some things are temporal, some things are geophysical, and there can be other ones. And so how do I get those all into a common place where they’re able to work and interface with each other? The sources of the data — so where do I have visibility, what’s generating the data? Because when you talked about networking devices, you talked about databases. But there’s also the people aspect, people specifically generate data. There are other devices. Just throwing out, like, comments to things like the [DoD Joint Artificial Intelligence Center]; we’re going to have machine learning and artificial intelligence that depends on data from a training set perspective and a certain level of integrity. And the potential bias in that is going to affect that. But that’s also going to be creating its own data as a result, as an output of those operations. So with data, it starts with what are my sources? What is my visibility of my sources? What’s the comprehension of those sources, compared to those questions that I want to ask — what are those missions — and then my ability to normalize and centralize that data for analysis and use?

Tom Temin: All right, and does that involve, then, a process of stripping out some of the formatting? And some of the metadata around the data and getting to elements that are then much more interoperable?

Bryson Bort: Yeah, I mean, so there’s a there’s a filter there to put it in a particular format that’s part of that normalization.

Tom Temin: And we mentioned the idea that it’s hard to find a needle in a haystack, if there is a needle in the first place. And there’s a time element to cybersecurity discovery and mitigation. And even for things that have a long dwell time, which they sometimes do sometimes don’t. So how do you analyze a data lake quickly? What are some of the technologies or techniques for sorting through large amounts of data such that you’re timely in response to what might happen?

Bryson Bort: So this is correlation. Data by itself, think of it as like singular atoms. And what I want to be able to do is apply structured queries or unstructured queries in different ways that match the questions that I either already know, or the ones that I want to ask. Those structured ones are where I have identified a pattern. This and this together is always going to answer this question for me. In simple security terms, let’s just look in terms of threat hunting, if I see this particular host activity tied to this host activity tied to this network traffic, that is a common attack chain for this kind of Chinese espionage campaign. And so I don’t want to be continually asking that question; I’ve identified that insight, that becomes a regular query now where the data with that query is now going to trigger an alert. So let’s get human intervention, the data is doing the work for us based on what we can see in the visibility to bring human interaction in or human interaction to, to go now into the detect, respond and remediate for what we now know, is a breach. Then there’s the unstructured queries. So the questions that I don’t yet know. As we want to look and identify an example, again using this the security example. So I’ve got that structured set. But what are some variables around that, that I could start to look at? What are some things like, well, if the traffic is from this particular location, or I’ve identified that the round trip on that traffic, that’s actually how we could have identified SolarWinds? SolarWinds has been in the news a lot. The traffic for SolarWinds had to go back to somebody, that somebody was not in this country. And so the TTL on the packet round-trip was actually longer than what it should have been. And that’s the kind of thing that this data can give you those rich insights where you you don’t have to know that it was three obfuscated bounce hops through the internet away to get to Moscow. I just know, the first thing looks good. But this data made you question that because there’s a pattern there that would have given it away.

Featured Publications