For decades, data scientists (née statisticians) have had sandboxes to explore data and find valuable insights. In what seemed like a happy compromise, analysts could quickly load, manipulate, and combine enterprise and industry data in search of new insights and predictions without worry that they would compromise sensitive data or production workflows. While this accelerated creating new insights, putting them into production was a nightmare. A bevy of custom code and data created in an ungoverned environment needed to be converted, quality controlled and optimized before deployment. It often took the better part of a year for a business to get value from an insight gleaned in a few weeks.
The specter of big data threatened to make the situation worse – in a big way. Now analysts were using data structures and programming languages foreign to IT. The volume and complexity of external data sources were exploding. Without a new approach, insights found in a big data sandbox might never make it into production.
What has emerged is a new paradigm that brings data governance – a term that is anathema to most analysts – to big data. But instead of heavy-handed restrictions on data usage and documentation, big data governance is agile, collaborative and efficient. It engages, not separates, analysts in capturing their learnings to accelerate production readiness. Most importantly, it replaces the massive conversion of sandbox data with a “promotion” process that ensures that analytics data is made production-ready on the big data platform.
Big data governance requires us to rethink governance from the ground up. Instead of physically separating sandbox and production data, big data governance logically controls access and usage as data matures from “raw” to “ready.” How can you tell if data is ready for production? Metadata. Any big data platform supporting production usage must have metadata tracking the lifecycle of data ingestion, validation, preparation and use. The metadata needs to manage data access rights, capture data profiling results, and commentary by data developers and end users. Metadata stores the policies that define production readiness, and is able to enforce them. Without metadata, a data lake becomes a data swamp.
But for this to be practical, metadata capture must be automated and relevant. A second tenet of big data governance contradicts current dogma: use schemas from the start to enrich metadata. Most business data is structured, whether it’s relational, log files, XML or mainframe copybooks. That structure can be used to automatically assess the quality, completeness, and content of raw data. This not only gives analysts insight into the data, it establishes a metadata foundation to build on.
The third principle of big data governance is scorecard-driven prioritization. Not all data needs strict governance over quality and access. In fact, the assumption is that most raw data loaded won’t be used – so enriching its metadata is a waste of time. Instead, scorecards are created for the various uses of the data – compliance reporting, marketing analytics, supply chain analysis, etc. Some policies apply to all scorecards – PII data needs to be masked – and others are very specific – data lineage is required for all compliance reports. With a metadata foundation, scorecards are easy to create for any data set. These scorecards are then used to identify and prioritize governance efforts to make the most important data production-ready.
Where to start? If you have a data lake with poor metadata, I recommend starting with an assessment of the quality and content of your existing assets. Automated tools can populate a metadata repository as a foundation for creating scorecards. Making the content and quality of the lake transparent is the first step toward big data governance.