GRC Data Lake

In our on-going BLOG series covering our Governance Execution Framework (GEF) we have introduced the topics of GRC analytical models, Hadoop, and federated data management configurations for GRC applications and specifically our own focus on industry benchmarking. So this post will dig a bit deeper and give you an overview on the concept of the GRC Data Lake.
As Wikipedia reports, A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs".
The term was coined by James Dixon, Pentaho chief technology officer. Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data. He wrote: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." Dixon argued that data marts have several inherent problems, and that data lakes are the optimal solution.
Dixon identified 2 shortcomings of data marts: "Only a subset of the attributes are examined, so only pre-determined questions can be answered." and "The data is aggregated so visibility into the lowest levels is lost." These problems are often referred to as "siloing" and, in agreement with Dixon, PricewaterhouseCoopers says that data lakes could "put an end to data silos". In their study on data lakes they note that "Enterprises across industries are starting to extract and place data for analytics into a single, Hadoop based repository." They note organizations such as UC Irvine Medical Center, Google and Facebook who have embraced the data lake concept.
One example of a data lake is the distributed file system, Apache Hadoop. Many companies also use cloud storage services such as Amazon S3. There is a gradual academic interest in the concept of data lakes, for instance,Personal DataLake an ongoing project at Cardiff University to create a new type of data lakes which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.
As the Spring issue of Big Data Quarterly magazine reports in their article, "Designing the Data Lake for Faster Time to Value", The data lake is a major opportunity. And, from our own GRC perspective we would add that it is not only a major opportunity, it's an opportunity that has a great potential to address the strategic shareholder value of GRC convergence (i.e. enterprise integration where GRC best practices are baked into the everyday business model". Data has become the lifeblood of nearly every industry-leading company. But the ability to turn this data into valuable industry insights with the right data delivered at the right time is what separates industry leaders from laggards. The concept of the data lake pattern has developed as a means to economically harness and derive value from exploding data volume and variety. New data sources such as web, mobile and connected devices along with new forms of analytics such as text, graph, and pathing have necessitated a new data lake design pattern. 
John O'Brien has been a strong influence on how our Members plan and strategize on their own use of GRC data lake deployments. John is Principle Analyst and CEO at Radiant Advisors. He has shared his data lake insights from his analysis of a number of early adopters. His advice is to focus on three high level business objectives to get started:
1.) Decide how you are going to organize data in your data lake.
2.) Determine how to unify workloads in the data lake.
3.) Shift your mindset from a "current project" to a long-term perspective.
We we explore these objectives in future BLOG posts, so stay tuned with our RSS feed.
Category: GRC Data Lake


Post new comment

The content of this field is kept private and will not be shown publicly.