GEF, GRC Data Lakes and Hadoop

We get a lot of questions from our Members about our Governance Execution Framework (GEF) and our Federated GRC Data Lake capabilities. These brand names are intimately tied to the world of Hadoop, Big Data, Internet of Things, predictive risk analytics, analytics, anti-fraud, cyber security, insider threat analysis and many other GRC-specific areas of concern. When I start to describe these major GRC initiatives, I get an immediate red flag in the form of a question. What is hadoop? So, I thought I would devote a number of BLOG posts to these topics. Here we go... let's start with Hadoop. This is an appropriate starting point, because Hadoop has just reached it's 10th year anniversity since it was conceived.
Note: Our GRC Sphere federated data management architecture capitalizes on both Google Big Query and Amazon's AWS Red Shift. We are ready to help you with your GRC Data Lake project!


Wikipedia defines Hadoop as follows: Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality — nodes manipulating the data they have access to— to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.


The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications; and
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.
The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem,[8] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.
Category: GRC Data Lake


Post new comment

The content of this field is kept private and will not be shown publicly.