How to avoid data lake failures

Gartner believes that 70 percent of mature organisations will have more data flowing from data lakes to data warehouses than vice versa by 2021

Comments

Organisations are enthusiastically pressing ahead with data lake implementations. The lack of documented failures has convinced many organisations that data lakes are a magical answer to their data and analytics requirements. Many will likely fail; they just haven’t yet.

Gartner believes that 70 percent of mature organisations will have more data flowing from data lakes to data warehouses than vice versa by 2021.

Data lakes support unknown data (less organised, raw or exogenous) and unknown questions (discovery and data science oriented) to enable exploration and innovation.

Enterprises implement data lakes for a variety of reasons, many of which aren’t necessarily the reason for using them. The most common is to provide data to highly-skilled information consumers more quickly, enabling them to bypass data warehousing and data mart environments. Other reasons include the desire to create experimental environments for data scientists or to replace existing data warehousing environments.

Most organisations we speak with about this topic believe that a data lake’s characteristics of flexible data storage and diverse processing options will simplify the data management tasks of managing governance, data lifecycles, data quality and metadata.

No silver bullet

The popular view is that a data lake will be the one destination for all the data in their enterprise and the optimal platform for all their analytics. This view, however, rests on three assumptions that have proven to be incorrect.

The first is that everyone in the enterprise is data literate enough to derive value from large amounts of raw or uncurated data. The reality is that only a handful of staff are skilled enough to cope with such data, and they’re likely doing so already.

The second is that the enterprise will be able to define cohesive governance and security policies across all datasets residing on a single cluster of physical infrastructure. The same attempt was made with data warehouse implementations, but proves far less successful with data lakes because the data they contain isn’t modelled. Creating policies for data without context is impossible.

The third is that data lake implementation technologies perform far better than they actually do, which leads to wild overestimations of their benefits.

Improve your chance of success

Here are the most common data lake failure scenarios and how you can avoid them.

Failure 1: Enterprise data lake

Enterprises implementing an enterprise data lake aim to unify multiple data silos into a single piece of physical infrastructure. The intention’s to readily provide all data to different groups throughout the organisation and to centralise data access for analytics. This scenario typically fails because of a variety of governance, performance and organisational challenges.

Putting all enterprise data in a single location is a long-held vision in the world of IT, and the characteristics of data lakes appear finally to make it realisable. The reality, however, is often quite different.

Reconciling the various governance, performance, political and cultural issues may take months or years. In the meantime, more autonomous parts of the organisation will create their own data lakes. These business unit lakes will be optimised for specific workloads, users and skills, and will likely be much more successful than their more aspirational enterprise data lake counterparts.

These smaller data lakes also point to a much larger trend for the decentralisation of data and analytics within organisations. By contrast, to implement an enterprise data lake is to run counter to the direction of many enterprise realities when it comes to data.

Failure 2: Data lake is my data and analytics strategy

A growing trend among data and analytics leaders is to create an overarching data and analytics strategy supporting broader digital business initiatives. Some look to data lakes as a quick substitute for more formal strategy development. Others have an ego-driven perspective – they see data lakes as means by which to be viewed as thought leaders, or to introduce major change to an enterprise they’ve recently joined.

This scenario typically fails for multiple reasons, including a misunderstanding of what constitutes a data and analytics strategy; lack of organisational clout or social capital; underestimation of data management capabilities’ immaturity; and misunderstanding of the diverse requirements of a data and analytics platform for digital business.

This scenario is the most difficult to avoid because the decision to pursue it is made largely for political, rather than business reasons. Accepting that success is unlikely is the best way to avoid it. You must define a data and analytics strategy. There’s no technology infrastructure that enables you to skip this step.

Failure 3: Infinite data lake

Organisations implementing an infinite data lake believe that all data maintains its original value indefinitely and that data doesn’t depreciate like other enterprise assets. Accordingly, these organisations expect their data lake infrastructure to scale indefinitely.

The value proposition is essentially that organisations no longer have to be concerned with data lifecycle or storage optimisation because the data lake will accommodate any amount of data, now and in the future.

This scenario is challenging to avoid because many people are obsessed with the idea that just having more data will solve their problems and help them seize new opportunities. There’s no doubt that possession of data can confer competitive advantage, but it must be timely data and relevant to current challenges and market opportunities.

Having more data for the sake of it doesn’t deliver beneficial business outcomes; it creates liability.

Nick Heudecker is a VP analyst at Gartner, offering guidance on data infrastructure for operations and analytics, as well as information management strategy. Additionally, he covers real-time analytics, in-memory technologies, and the acquisition and management of open-source software. Nick will be presenting at the Gartner Data & Analytics Summit in Sydney, 18-19 February 2019.