Skills That Matter in a World Awash in Data

animal-197161Original content found on

There is a paradox put forth by the French philosopher Jean Buridan which is commonly referred to as Buridan’s Ass. One interpretation goes something like this: Take a donkey and stake it equidistant between two identical piles of hay. Since the donkey is incapable of rational choice and the piles of hay are indistinguishable, the donkey will die of hunger. Of course, in the real world, we all presume the donkey would somehow “pick” a pile. We accept these situations all around us: fish seem to “choose” a direction to swim, birds of the same species seem to “decide” whether or not to migrate, and data seems to “suggest” things that we wish to prove. Which of these is not like the others? The answer is the data. Data has no ability to “act” on its own. We can use it or not, and it simply doesn’t care. The choice is entirely ours. The challenge is how we decide rationally what data to use and how to use it, when we have enough data, and when we have the “right” data. Making the wrong choice has serious consequences. Making the right choice can lead to enormous advantage.

Let’s look at the facts. We know that we are living in a world awash in data. Every day, we produce more data than the previous day, and at a rate which is arguably impossible to measure or model because we have lost the ability to see the boundaries. Data is not only created in places we can easily “see” such as the Internet, or on corporate servers. It is created in devices, it is created in the cloud, it is created in streams that may or may not be captured and stored, and it is created in places intentionally engineered to be difficult or impossible to perceive without special tools or privileges. Things are now talking to other things and producing data that only those things can see or use. There is no defendable premise that we can simply scale our approach to data from ten years ago to address the dynamic nature of data today.

This deluge of data is resulting in three inconvenient truths:

  1. Organizations are struggling to make use of the data already in hand, even as the amount of “discoverable” data increases at unprecedented rates.
  2. The data which can be brought to bear on a business problem is effectively unbounded, yet the requirements of governance and regulatory compliance make it increasingly difficult to experiment with new types of data.
  3. The skills we need to understand new data never before seen are extremely nuanced, and very different than those which have led to success so far.

Data already in hand – think airplanes and peanut butter.

Recently, I was on a flight which was delayed due to a mechanical issue. In such situations, the airline faces a complex problem, trying to estimate the delay and balance regulations, passenger connections, equipment availability, and many other factors. There is also a human element as people try to fix the problem. All I really wanted to know was how long I had in terms of delay. Did I have time to leave the gate and do something else? Did I have time to find a quiet place to work? In this situation, the answer was yes. The flight was delayed 2 hours. I wandered very far from the gate (bad idea). All of a sudden, I got a text message that as of 3:50PM, my flight was delayed to 3:48PM. I didn’t have time to wonder about time travel… I sprinted back to the gate, only to find a whole lot of nothing going on. It seemed that the airline systems that talk to each other to send out messaging were not communicating correctly with the ones that ingested data from the rest of the system. Stand down from red alert… No plane yet. False alarm.

While the situation is funny in retrospect, it wasn’t at the time. How many times do we do something like this to customers or colleagues? How many times do the complex systems we have built speak to one another in ways that were not intended and reach the wrong conclusions or send the wrong signals? I am increasingly finding senior executives who struggle to make sense out of the data already on-hand within their organization. In some cases, they are simply not interested in more data because they are overwhelmed with the data on hand.

This position is a very dangerous one to take. We can’t just “pick a pile of hay.” There is no logical reason to presume that the data in hand is sufficient to make any particular decision without some sort of analysis comparing three universes: data in hand, data that could be brought to bear on the problem, and data that we know exists but which is not accessible (e.g. covert, confidential, not disclosed). Only by assessing the relative size and importance of these three distinct sets of data in some meaningful way can we rationally make a determination that we are using sufficient data to make a data-based decision.

There is a phenomenon in computer science known as the “dispositive threshold.” This is the point at which sufficient information exists to make a decision. It does not, however, determine that there is sufficient information to make a repeatable decision, or an effective decision. Imagine that I asked you if you liked peanut butter and you had never tasted it. You don’t have enough information. After confirming that you know you don’t have a peanut allergy, I give you a spoon of peanut butter. You either “like” it or you don’t. You may feel you have enough information (dispositive threshold) until you learn that there is creamy and chunky peanut butter and you have only tasted one type, so you ask for a spoon of the other type. Now you learn that some peanut butter is salted and some isn’t. At some point, you step back and realize that all of these variations are not changing the essence of what peanut butter is. You can make a reasonable decision about how you feel about peanut butter without tasting all potential variations of peanut butter. You can answer the question “do you like peanut butter” but not the question “do you like all types of peanut butter.” The moral here, without getting into lots of math or philosophy, is this:

It is possible to make decisions with data if we are mindful about what data we have available. However, we must at least have some idea of the data we are not using in the decision-making process and a clear understanding of the constraints on the types of decisions we can make and defend.

Governance and regulatory compliance – bad guys and salad bars.

Governance essentially boils down to the three time-worn pieces of advice: “say what you’re going to do, do it, say you did it.” Of course, in the case of data-based decision making, there are many nuances in terms of deciding what you are going to do. Even before we consider rules and regulations, we can look at best practice and reasonableness. We must decide what information we will allow in the enterprise, how we will ingest it, evaluate it, store it, and use it. These become the rules of the road and governance is the process of making sure we follow those rules.

So far, this advice seems pretty straightforward, but consider what happens when the governance system gets washed over by a huge amount of data that has never been seen before. Some advocates of “big data” would suggest ingesting the data and using techniques such as unsupervised learning to tell us what the data means. This is a dangerous strategy akin to trying to eat everything on the salad bar. There is a very real risk that some data should never enter the enterprise. I would suggest that we need to take a few steps first to make sure we are “doing what we said we will do.” For example, have we looked at the way in which the data was created, what it is intended to contain, and a small sample of the data in a controlled environment to make sure it lives up to the promised content. Small steps before ingesting big data can avoid big, possibly unrecoverable mistakes.

Of course, even if we follow the rules very carefully, the system changes. In the case of governance, we must also consider the changing regulatory environment. For example, the first laws concerning expectations of privacy in electronic communication were in place before the Internet changed the way we communicate with one another. Many times, laws lag quite significantly behind technology, or lawmakers are influenced by changes in policy, so we must be careful to continuously re-evaluate what we are doing from a governance perspective to comply not only with internal policy, but also with evolving regulation. Sometimes, this process can get very tricky.

Consider the situation of looking for bad behavior. Bad guys are tricky. They continue to change their behavior, even as systems and processes evolve to detect bad behavior. In science, these types of problems are called “quantum observation” effects, where the thing being observed changes by virtue of being observed. Even the definition of “bad” changes over time or from the perspective of different geographies and use cases. When we create processes for governance, we look at the data we may permissibly ingest. When we create processes for detecting (or predicting) bad behavior, the dichotomy is that we must use data in permissible ways to detect malfeasant acts that are unconstrained by those same rules. So in effect, we have to use good data in good ways to detect bad actors operating in bad ways. The key take-away here is a tricky one:

We must be overt and observant about how we discover, curate and synthesize data to discover actions and insights that often shape or redefine the rules.

The skills we need – on change and wet babies.

There is an old saying that only wet babies like change all the time. The reality is that all of the massive amounts of data facing an enterprise are forcing leaders to look very carefully at the skills they are hiring into the organization. It is not enough to find people who will help “drive change” in the organization – we have to ensure we are driving the right change because the cost of being wrong is quite significant when the pace of change is so fast. I was once in a meeting where a leader was concerned about having to provide a type of training to a large group because their skill level would increase. “They are unskilled workers. What happens if we train them, and they leave?” he shouted. The smartest consultant I ever worked with crystallized the situation with the perfect reply, “What happens if you don’t train them and they stay!” Competitors and malefactors will certainly gain ground if we spend time chasing the wrong paths of inquiry, yet we can just as easily become paralyzed with analysis and do nothing, which is itself a decision that has cost (the cost of doing nothing is often the most damaging).

The key to driving change in the data arena is to balance the needs of the organization in the near term with the enabling capabilities that will be required in the future. Some skills, like the ability to deal with unstructured data, non-regressive methods (such as recursion and heuristic evaluation), and adjudication of veracity will require time to refine. We must be careful to spend some time building out the right longer-term capabilities so that they are ready when we need them.

At the same time, we must not ignore skills that may be needed to augment our capability in the short term. Examples might include better visualization, problem formulation, empirical (repeatable) methodology, and computational linguistics. Ultimately, I recommend considering three categories to consider from the perspective of skills in the data-based organization:

Consider what you believe, how you need to behave, and how you will measure and sustain progress.

Ultimately, the skills that matter are those that will drive value to the organization and to the customers served. As leaders in a world awash in data, we must be better than Buridan’s Ass. We must look beyond the hay. We live in an age where we will learn to do amazing things with data or become outpaced by those who gain better skills and capability. The opportunity goes to those who take a conscious decision to look at data in a new way, unconstrained and full of opportunity if we learn how to use it.

Posted in Big Data, Data Management by Anthony Scriffignano

Anthony is the Chief Data Scientist of Dun & Bradstreet.

Leave a Reply

Your email address will not be published. Have a question and don't want to leave a comment? Drop us a line.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>