More than 250 data management professionals dialed in to our October webinar on Data Lake Architecture. The webinar, which is part of the Data Insights & Analytics series, is produced in partnership with DATAVERSITY and is the first Thursday of each month.
First San Francisco Partners’ John Ladley and Kelle O’Neal were the webinar presenters. They shared their perspectives on data lake architecture and started the call with a poll, asking the audience if their organization had a data lake. About 20% of the webinar attendees said they did, and a third said their organization didn’t have a data lake.
Here are high-level takeaways from the webinar, but be sure to read on to learn how you can get the full webinar recording and our presentation material.
Benefits of a Data Lake
- The data lake enables “production-izing” of advanced analytics, making them readily available and acceptable.
- It offers a flexible environment and cost-effective scalability, which can reduce long-term cost of ownership.
- The data lake can deliver meaningful value for the organization — e.g., to be more competitive, increase productivity, reduce costs, etc.
Risks of a Data Lake
- If the data lake is a disorganized “swamp” that can’t be used or managed effectively, there may be a loss of trust in its analytics.
- Without adequate security and access controls, a data lake may increase organizational risk — and this may impact data privacy and increase compliance risk/exposure.
- There are long-term costs associated with a data lake, which may come under scrutiny if the organization isn’t deriving benefits from it.
Data Lake Reference Architecture
- The modern data lake is more robust than early iterations, due to increased storage availability, new data management tools and the ease of which data can be managed.
- Today’s data lake architecture includes these components:
- Landing Zone – area closest to the original conception of a data lake, where raw data is stored and is available for consumption
- Standardization Zone – for standardized, cleaned data, which is preferred for downstream consumers and the Analytics Sandbox
- Analytics Sandbox – where Data Scientists work to create new analytical models
The Lab and the Factory
- A key to successful data lake management is understanding if the environment is a Lab, a Factory or a combination of the two.
- Lab characteristics:
- Allows for experimentation, testing new models and proof of concepts
- Offers a more flexible architecture, even an ad hoc or non-persistent environment
- Rarely documented
- Schema on read
- Run by main users or departments and is more informal
- By nature, a lab’s results should be evaluated for relevance
- Factory characteristics:
- Addresses specific requirements and produces regular outputs associated with a business service, product or action
- Architecture needs to be defined, so its use and limits are understood
- Published rules of engagement
- Data quality is monitored and known
- Lineage and metadata support navigation and use of content
- May need scheduled access and loading
- Publishing results will require some form of quality control and approval
- Models executed on a scheduled basis will require administrative and maintenance capabilities
Base Environment for Batch Analytics, Streaming and Real-time Data
- Foster an effective data supply chain — getting the right, quality data to where it’s supposed to go.
- Plan for rapid ingestion of data, as latencies are always being driven down.
- Build for flexibility, so you encourage experimentation that doesn’t pollute the lake.
Critical Governance Components for Data Lakes
- Consider the lake environment — is it a Lab, a Factory or both — as you manage governance needs.
- A Lab’s flexibility is its hallmark, but governance is required to ensure appropriate use.
- With a Factory, governance ensures its operational use, its legitimacy, compliance and alignment with business needs.
- As with most all business functions, developing processes, training staff and ongoing monitoring to ensure compliance is critical.
Download Data Lake Architecture Webinar
Here, we highlighted key parts of John and Kelle’s Data Lake Architecture presentation. To learn more from the session, listen to the full recording in DATAVERSITY’s on-demand library. You can also download the presentation material from DATAVERSITY’s SlideShare page.
Up Next Month: Our Data Viz Webinar
We’re looking forward to the November 2 Data Insights & Analytics webinar, Keys to Effective Visualization, and hope you can join us. Learn more and RSVP.
Want to stay in the loop on future FSFP webinars? Let us know!