Chemical Distribution Company

The Challenge

Timeliness of Data

  1. No ability to process real-time data
  2. Multiple data hops
  3. Sequential data load process

Source of Truth

  1. Same copy of data copied to different servers
  2. Each business unit performing their own set of logic, calculations, transformation

Self Service BI Solution and Support

  1. Limited ability and framework

The Solution

AWS Data Lake solution can help organizations reduce costs, improve efficiency, boost productivity and improve customer acquisition/retention by enabling following capabilities:

  1. Provide single source of truth for all data needs, tighter data Integrity, improved accuracy and reduced data redundancy
  2. Ability to scale to high data volumes in a cost effective manner
  3. Provides Advanced Analytics capabilities:
    A. Allows organizations to support different types of analytics like machine learning, ad-hoc queries, big data analytics, full text search, real-time analytics over multiple data sources stored in the data lake
    B. Allows organizations to generate effective insights including reporting on historical data, predictive analytics through machine learning models to provide forecasting, recommendations, etc.
  4. Allows various roles in organization like data scientists, data analysts, and business analysts to access data with their choice of analytic tools

 

AWS Data Lake Architecture using the following key considerations:

 

  • S3 acts as the data hub for serving data
  • Glue Data Catalog crawls S3 objects to generate schema definitions and integrates with EMR, Athena and Redshift Spectrum
  • DynamoDB stores S3 object index values as well process control metadata
  • Lambda, EMR serves as the data integration layer to serve data to data marts like RDS, Aurora, SQL Server, Redshift, SFDC
  • Athena used for S3 in-place queries
  • S3 data is organized and readily available for easy access to Machine Learning and SageMaker
  • CloudWatch and CloudTrail for audit and logging

 

Below defines the data flow on S3:

  1. Raw: This layer contains the “as-is” data from various sources like SFDC, SAP, files, etc.
  2. Semi-Transformed: This layer is created with light modelling on the raw data. This involves application of light transformations and corporate level naming conventions.
  3. Transformed: This is the layer where the data will be transformed based on business unit/functional requirements and will be the access point for their data sets.

The Outcome