Data Engineering Bootcamp - Series 1
🔥 Your Data Engineering Journey Starts Now!
Offer Expires In:
What You'll Learn
01-Context Setup
This foundational section lays the groundwork for your data engineering journey by introducing you to the Modern Data Stack and how it powers data-driven companies. Using a ride-hailing application use case, you’ll gain a solid understanding of data sources, OLTP systems, and the architecture of modern data platforms. This section ensures you’re well-prepared to dive into the more advanced topics covered in the bootcamp.
Key Features of This Section
Comprehensive Bootcamp Onboarding
Start your journey with a step-by-step onboarding process that provides clear guidance on expectations, resources, and tips to succeed throughout the bootcamp.
Learn the Foundations of Data Engineering
Understand the role of data sources and OLTP systems in the data ecosystem, providing critical context for designing and managing data workflows.
Explore the Modern Data Stack
Get introduced to the Modern Data Stack architecture, its components, and how leading data-driven companies implement it to unlock business value.
Real-World Use Case Integration
Learn through the lens of a ride-hailing application to see how these foundational concepts apply to real-world scenarios.
Foundation for Advanced Learning
Build a solid conceptual understanding of data platforms and pipelines that sets the stage for mastering complex topics such as storage, processing, orchestration, and analytics in later sections.
What Makes This Section Unique
Provides a big-picture view of how data flows through modern organizations.
Focuses on practical relevance by tying every concept to an industry-standard use case.
Empowers students with a clear roadmap to navigate the bootcamp with confidence.
What You Will Take Away
A clear understanding of data sources and OLTP systems as the backbone of data engineering.
Insight into how the Modern Data Stack operates in real-world organizations.
The confidence and preparation needed to tackle advanced topics throughout the bootcamp.
This section ensures you start your bootcamp journey with a strong foundation, helping you connect theoretical concepts to practical applications from the very beginning!
02-Data Lake Essentials
This section of the bootcamp is designed to give you a comprehensive understanding of Data Lake Design through a hands-on, real-world use case of a ride-hailing company's data. By the end of this section, you’ll have mastered essential concepts and practical skills needed to design, manage, and optimize scalable data lakes using AWS S3.
Key Features of This Section
Complete Data Lake Architecture Blueprint
Understand the foundational layers, data partitioning strategies, and file formats for designing scalable and efficient data lakes.
Hands-On Labs with Real-World Scenarios
Perform practical exercises like creating buckets, implementing data partitioning, setting up event notifications, and more using AWS S3.
Schema Evolution and Data Management
Learn to handle schema changes seamlessly while ensuring data consistency and reliability in your data lake.
Robust Data Security and Access Controls
Implement S3 IAM, ACLs, and encryption techniques to safeguard your data with enterprise-grade security measures.
Lifecycle Management & Cost Optimization
Gain expertise in managing data lifecycles, S3 storage classes, and backup strategies to optimize costs and maintain efficiency.
Advanced Data Lake Monitoring and Insights
Use tools like S3 Storage Lens and S3 Metadata to monitor, analyze, and improve the performance of your data lake.
Disaster Recovery Best Practices
Learn strategies for backup and recovery to ensure your data lake is resilient and always available.
Event-Driven Workflows with S3 Events
Set up and trigger workflows using S3 Event Notifications to automate and streamline data processing.
API-Driven Data Lake Operations
Master Boto3 S3 APIs for programmatically managing your data lakes, empowering you to automate operations effectively.
What You Will Build
A fully functional data lake for a ride-hailing use case, designed to handle large-scale data ingestion, processing, and analytics with best practices.
Why This Section Stands Out
Combines theory, architecture, and hands-on labs into a seamless learning experience.
Focuses on real-world challenges and equips you with the tools to solve them effectively.
Teaches cost-efficient, secure, and scalable solutions for modern data engineering.
By the end of this section, you'll not only know how to design a data lake but also have the confidence to implement it in a production environment!
03-Data Modeling
This section dives deep into the art and science of data modeling, equipping you with the skills needed to design and implement robust data models for real-world applications. Through hands-on labs and the ride-hailing company use case, you will master the intricacies of dimension modeling, fact modeling, and ETL development, building a strong foundation for creating efficient and scalable data marts.
Key Features of This Section
Master Data Modeling Fundamentals
Learn the core principles of data modeling and the different types of data models, building a theoretical foundation for designing impactful data structures.
Design Star Schema for Real-World Applications
Use a ride-hailing company use case to design a star schema data model, helping you understand how to structure data for analytics and reporting.
Dimension Modeling Techniques
Deep dive into Slowly Changing Dimensions (SCD) with hands-on labs to implement SCD Type 1 and SCD Type 2 dimension tables.
Fact Modeling and Data Marts
Learn to design and build fact tables that capture measurable business events and create data marts optimized for business insights.
End-to-End ETL Pipeline Development
Write and execute ETL scripts to load and manage dimension and fact tables, reinforcing your skills in transforming and managing data.
Hands-On Labs for Real-World Experience
Lab 1: Implement SCD Type 1 to manage dimension updates.
Lab 2: Master SCD Type 2 for historical data tracking.
Lab 3: Build and populate fact tables for analytical insights.
What Makes This Section Unique
Focuses on practical, real-world implementations using a relatable business scenario (ride-hailing).
Covers both conceptual and hands-on aspects of data modeling, ensuring a well-rounded learning experience.
Provides a step-by-step approach to designing, implementing, and managing data models and ETL pipelines.
What You Will Take Away
A clear understanding of dimension modeling and fact modeling principles.
The ability to design and implement SCD Type 1 and SCD Type 2 dimension tables.
Expertise in creating star schema data models and data marts for business intelligence.
Practical experience writing ETL scripts to load and transform data into analytical-ready structures.
This section bridges the gap between theoretical knowledge and hands-on application, empowering you to design and implement data models like a pro!
04-Data Quality
This section focuses on the critical importance of data quality in ensuring trustworthy and reliable data pipelines. Leveraging the ride-hailing company use case, you will learn how to implement comprehensive data quality checks for dimension and fact tables, ensuring accuracy, consistency, and reliability in the star schema data model you’ve built in the previous section.
Key Features of This Section
Master Data Quality Fundamentals
Understand the principles of data quality and why it is essential for building trustworthy data pipelines in modern data systems.
Explore Different Types of Data Quality Checks
Learn about various data quality dimensions such as accuracy, completeness, consistency, uniqueness, and timeliness to ensure a 360-degree approach to data validation.
Hands-On Implementation of Data Quality Checks (DQC)
Implement custom data quality checks for both dimension tables and fact tables in a star schema data model, ensuring the integrity of your data pipeline.
Learn About DQC Tools and Data Contracts
Discover the best data quality tools and frameworks available and understand how data contracts can enforce accountability and reliability in data delivery.
Practical Lab Exercises for Real-World Application
Lab 1: Implement data quality checks (DQC) step-by-step for dimension and fact tables, applying the concepts learned to a real-world scenario.
What Makes This Section Unique
Provides a real-world focus by applying data quality checks to the ride-hailing company’s star schema model.
Combines theoretical understanding with practical implementation for a hands-on learning experience.
Highlights the role of data contracts in ensuring reliability in data-driven workflows.
What You Will Take Away
A solid understanding of data quality fundamentals and best practices.
Proficiency in implementing data quality checks for dimension and fact tables in a star schema.
Knowledge of industry-standard DQC tools and their application in modern data pipelines.
Insights into how data contracts create trust and accountability in data systems.
This section empowers you to identify and address data quality issues proactively, ensuring you build data pipelines that are not only scalable but also reliable and error-free!
05-Athena
This section dives deep into AWS Athena, the widely-used SQL interface on top of AWS S3 Data Lakes, providing you with the skills to query and analyze massive datasets efficiently. Leveraging the ride-hailing company use case, you'll design and query partitioned tables to unlock the full potential of your data lake.
Key Features of This Section
Master AWS Athena Fundamentals
Understand the architecture of AWS Athena and how it integrates with AWS S3 and the Glue Data Catalog to query data lakes seamlessly.
Comprehensive Comparison of PrestoDB, Trino, and Athena
Learn the differences between PrestoDB, Trino, and Athena, helping you choose the right tool for your data lake use cases.
Hands-On SQL Querying with Athena
Explore Data Definition Language (DDL) concepts and create partitioned tables using the AWS Glue Catalog to optimize query performance for large datasets.
Workgroup Management for Efficiency
Learn how to manage Athena workgroups for cost control, query optimization, and team collaboration.
Automation with Boto3 APIs
Use Boto3 Athena APIs to programmatically interact with Athena, automate query execution, and build scalable data processing pipelines.
Best Practices for Maximum Performance
Get insights into Athena best practices for partitioning, query optimization, and cost management to ensure efficient and cost-effective usage.
What Makes This Section Unique
Real-world focus on partitioned and non-partitioned table design for the ride-hailing use case, showcasing how to build and query optimized data lake architectures.
Covers both theoretical concepts and practical implementation, offering a balanced and in-depth learning experience.
Automation focus with Boto3 APIs, enabling you to integrate Athena into modern data workflows effortlessly.
What You Will Take Away
A solid understanding of AWS Athena architecture and its integration with S3 Data Lakes.
Skills to create and query partitioned tables for enhanced performance in data lake architectures.
Proficiency in managing Athena workgroups for cost-efficient querying.
Knowledge of Boto3 Athena APIs to automate query executions and scale workflows.
Best practices to ensure cost-efficiency, query optimization, and scalability in Athena.
This section empowers you to harness the power of AWS Athena to query large datasets efficiently, making it a vital skill for any modern data engineer working with data lakes!
06-Spark
This section provides a comprehensive guide to mastering Apache Spark, the cornerstone of modern data processing. Using the ride-hailing use case, you will learn to design and implement scalable, production-ready data pipelines. From understanding Spark architecture to deploying data pipelines on AWS EMR, this section equips you with the skills needed to process large-scale datasets efficiently.
Key Features of This Section
In-Depth Spark Fundamentals
Understand Spark architecture and its components to build a strong foundation in distributed data processing.
Hands-On PySpark Labs
Explore PySpark APIs and write transformation logic through practical labs, ensuring hands-on experience with real-world data scenarios.
Custom Transformations with UDFs
Learn to write and use User-Defined Functions (UDFs) for handling complex data transformations.
Real-World Data Pipeline Design
Mimic a real-world transformation for the fact_booking table in a ride-hailing application, covering data quality checks and pipeline design.
Master the WAP Pattern
Implement the WAP (Write, Audit, Publish) pattern, a key principle for designing auditable and scalable data pipelines for enterprise systems.
AWS EMR Integration
Gain expertise in creating AWS EMR clusters and running Spark pipelines in a cost-effective, scalable cloud environment.
Best Practices for Spark Pipelines
Learn industry-standard best practices for optimizing Spark jobs, managing resources, and building production-grade pipelines.
What Makes This Section Unique
Real-world focus: Apply your knowledge to a ride-hailing application use case, mimicking industry-scale challenges.
Comprehensive pipeline training: Learn to design, implement, and deploy end-to-end pipelines, including data quality checks and transformation logic.
Cloud deployment: Gain practical experience with AWS EMR, making you proficient in running Spark workloads in the cloud.
WAP pattern mastery: Learn a proven framework for building scalable and auditable pipelines, a skill in high demand for large-scale systems.
What You Will Take Away
A solid understanding of Apache Spark architecture and its role in the modern data stack.
Hands-on experience with PySpark APIs, UDFs, and designing complex transformation logic.
The ability to implement data pipelines with WAP (Write, Audit, Publish) patterns for scalability and reliability.
Confidence in deploying Spark pipelines on AWS EMR and managing cloud infrastructure.
Knowledge of best practices for building efficient, production-grade Spark pipelines.
This section empowers you to harness the full potential of Apache Spark, enabling you to process large-scale data efficiently and deploy robust, cloud-based pipelines that meet modern business needs.
07-Airflow
This section provides a complete guide to mastering Apache Airflow, the industry-standard tool for orchestrating data pipelines in the modern data stack. Through hands-on labs and real-world examples, you’ll learn to design, schedule, and manage dynamic workflows with Airflow. By the end of this section, you will have the confidence to build robust, production-grade pipelines that include custom Airflow plugins for extended functionality.
Key Features of This Section
Comprehensive Airflow Fundamentals
Deep dive into Airflow architecture, components, and how they interact to orchestrate workflows.
Local Setup and Hands-On Labs
Learn how to set up Airflow locally, ensuring you can practice and experiment with its powerful features.
Hands-on labs that cover real-world data ingestion and transformation pipelines.
Mastering DAGs and Tasks
Understand Directed Acyclic Graphs (DAGs), Tasks, Operators, and their role in scheduling and managing workflows.
Custom Plugin Design for AWS EMR
Build a custom AWS EMR plugin to automate critical tasks such as creating EMR clusters, submitting PySpark jobs, and terminating clusters.
Improve code modularity and reusability with Airflow plugins, a must-have skill for advanced data pipeline design.
End-to-End Pipeline Automation
Automate the entire Spark pipeline process (from the previous section) using Airflow, including cluster creation, job submission, and termination.
Real-World Use Cases
Design and implement a data ingestion pipeline and a data transformation pipeline, both tailored to mimic enterprise-scale workflows.
What Makes This Section Unique
End-to-End Orchestration: Learn to integrate Airflow with tools like AWS EMR and PySpark, automating real-world data pipelines.
Hands-On Plugin Development: Get hands-on experience building a custom AWS EMR plugin, a skill highly sought after in data engineering roles.
Practical Labs: Apply your learning immediately through structured labs, ensuring you understand core Airflow concepts in depth.
Advanced Scheduling Concepts: Go beyond the basics to master task dependencies, retries, parallel execution, and dynamic workflows.
What You Will Take Away
A solid understanding of Airflow architecture and its role in the modern data stack.
The ability to design and implement DAGs for complex workflows using tasks, operators, and dependencies.
Expertise in building custom Airflow plugins for specialized use cases, enabling extended functionality and code reuse.
Real-world experience in automating data pipelines with Airflow, including running Spark jobs on AWS EMR.
Confidence in managing, debugging, and optimizing Airflow workflows for large-scale systems.
This section equips you to orchestrate data workflows seamlessly using Airflow, making you a skilled data engineer ready to tackle complex, production-grade pipelines.
What's included
32 Video Lectures
23 Hands-on Exercises
Community Space for Interaction
Tools Installation Guide
Certification
Andalib Ansari
Hi, I'm Andalib Ansari, a seasoned Data Engineer with over 11 years of experience across online gaming, ride-hailing, SaaS, and telecom industries.
🔹 At Grab, Singapore, the largest ride-hailing company in Southeast Asia, I designed and developed a centralized Data Warehouse and large-scale data pipelines that processed data across all of Grab’s verticals, including bookings, payments, food, and delivery. These pipelines powered analytics and mission-critical dashboards used daily by the CEO and other C-level executives enabling data-driven decision-making at the highest level.
🔹 At Microgaming, Singapore & Australia, a global leader in online gaming, I led the development of scalable data platforms and pipelines, that powered daily executive dashboards, operational and product analytics, and finance driven reporting for billing and revenue insights. My work ensured data accuracy, efficiency, and strategic decision making at the highest levels of the organization.
💡 Teaching & Course Creation Experience
Seven years ago, I launched a Big Data & Hadoop course on Udemy that attracted 28,000+ students from 145+ countries. I taught Big Data fundamentals, Hadoop, MapReduce, Hive, and Pig, the leading technologies at that time. Over the years, I have expanded my expertise to modern data engineering tools like Apache Spark, Presto, AWS Athena, Redshift, Data Quality frameworks, and workflow orchestration with Airflow. My journey has also involved leading teams, managing stakeholders, and architecting large-scale data platforms.
🚀 What This Bootcamp Offers You
From architecting cutting-edge data solutions to mentoring aspiring engineers, I have lived and breathed data engineering throughout my career. This bootcamp is your gateway to mastering real-world skills, where I will personally guide you through hands-on projects, industry insights, and practical expertise.
Let’s unlock your potential and shape the future of data together!