Four Pitfalls to Avoid when Starting a Big Data Project

BigDataTileIn spite of all the excitement over Big Data, many companies will struggle with their Big Data initiatives in 2014.  Why’s that? Because there are some common myths and misperceptions in the market about the best ways to approach Big Data projects – and they can really trip you up.

Pitfall #1. Putting the cart before the horse. Clients often come to their Big Data projects with clear goals: to mine social networks, gather sensor data or add public data. That’s the promise of a Big Data initiative. The reality can be quite a bit less exciting. Most of the clients I talk to consistently ignore the basic practicalities of dealing with ever-growing data stores. When I examine their data architectures, it’s common to find that their data is in a poor state – riddled with errors, with records duplicated between systems. In the wild, data proliferates. And since storage is relatively inexpensive, it’s common to have data sprawl, both on premise and in the cloud, with companies keeping multiple copies of their data in different systems. The most extreme case of this I’ve seen was a client that was keeping nine copies of corporate data – not a good foundation for any Big Data initiative.

Companies with Big Data ambitions for the New Year should start by improving the quality, or data hygiene, of the data they already own. A good first step is to synchronize all copies of their corporate data.  It makes no sense to add Big Data into an unmanaged mix of corporate data that is so common in most environments.

Pitfall #2. Putting data quality at the end of the process. There’s a common myth that it’s best to address data quality at the end of the Big Data pipeline, during the later stages of processing. While it may be true that the new data added in a Big Data project, such as behavioral data or location data, can initially contain ‘junk’ data, that approach can cause serious problems when you’re dealing with your core corporate dataset.

A better approach is to apply data quality checks at the beginning of the pipeline, to improve the quality of the data you already own and to lay a strong foundation for the Big Data project. This is best done before acquiring new streams and types of data, such as external data, public data, machine to machine (M2M) data and the like. Applying data quality checks is also part of the process of weaning in new datasets, such as public data, before combining that data with existing corporate datasets.

There are a variety of solutions you can use to de-duplicate, validate and complete your key data, such as your customer and company information. Vendors such as D&B offer dataset integration with familiar analyst tools, such as PowerBI in Office 365 and also in Excel, that make their Cleanse and Match and Business Verification products attractive in ‘clean-at-the-beginning’ scenarios for Big Data.

Pitfall #3. Embracing new technologies – even if they don’t solve real-world problems. There’s a lot of buzz around Hadoop, in-memory computing and actionable analytics. All these technologies have their uses. But keep in mind that tomorrow’s magic data visualization tool won’t help anyone make sense of source data that is basically unsound.

At the risk of repeating myself: Start by evaluating and cleaning your data. It’s a good idea, and it can have a side benefit. When cleaning source data, I often find a very large amount of waste in the source systems, resulting in what I call “Small Big Data” projects. These projects can run on source systems such as relational databases, because the data volumes don’t warrant use of Hadoop or NoSQL solutions. Running “SmallBigData” projects using current systems and processes can be a time and money saver, since there are relatively few (if any) training needs for your existing staff.  I’ve guided teams to upgrade to SQL Server 2012 to better accommodate these types of projects, and during that process, helped them  understand how to make use of Enterprise features. This scenario is much simpler for these teams to get immediate business value.   I’ve also guided teams toward using MongoDB rather than Hadoop, as the adoption curve is simpler for the former when coming from a relational system.   That being said, there are of course Big Data projects where Hadoop is warranted, eg when you are using  newer technologies.  Guiding factors are the commonly stated volume, variety and velocity of data – meaning how much data, how fast and of what type(s).

Pitfall #4. Taking an IT-driven approach. Another common fallacy of Big Data projects is that they should be driven by the IT department. However, my experience is that successful Big Data projects have a strong corporate sponsor and are usually driven by the business analysts, with support from the IT group.  Analysts best understand the core business processes and executive sponsorship is always key to any successful change to the core IT structure. IT can and should play a partner role (rather than a lead) in these types of projects.

Simply put – business goals should be the primary driver and technology implementation is secondary. When IT leads BigData initiatives I have seen these projects produce results, i.e. a new model, cube, data store, report, etc…however success is measured by the value these new objects are providing to the business.  If they are not accepted and/or used by the business side of the house, then there is little value in the project.

To summarize, Big Data projects are best started by getting your current (data) house in order – this process can be aided by some of the D&B data services available on Microsoft’s Windows Azure Marketplace  from Cleanse & Match, Business Verification, and many others that help you clean, enrich and glean valuable insight your business data.  Also, the integration into current technologies, such as SQL Server SSIS and Excel to name a couple of examples, further improves usability.

Wishing you a clean, rich and verified 2014!


Posted in Big Data, Data Quality by Lynn Langit

Lynn Langit is a Big Data & Cloud expert and a D&B MVP. Read more about her at and

Leave a Reply

Your email address will not be published. Have a question and don't want to leave a comment? Drop us a line.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>