Skip to content

BuildCPG Labs

A modern, scalable data engineering platform for managing multiple independent labs using dbt, DuckDB, and Python.

Overview

BuildCPG Labs enables you to:

  • Run independent labs - Each lab has its own database, data, and Python environment
  • Share utilities - Common code used by all labs without duplication
  • Scale easily - Create new labs in minutes using templates
  • Maintain quality - Built-in data inspection and automated quality checks
  • Work efficiently - Per-lab virtual environments with standardized workflows

Architecture

graph TB
    subgraph "BuildCPG Labs Platform"
        S[shared/<br/>Utilities & Templates]

        subgraph "Lab 1 - Sales Performance"
            L1[dbt Models]
            L1D[DuckDB Database]
            L1E[lab1_env]
        end

        subgraph "Lab 2 - Market Sentiment"
            L2[dbt Models]
            L2D[DuckDB Database]
            L2E[lab2_env]
        end

        subgraph "Lab 3 - Customer Segmentation"
            L3[dbt Models]
            L3D[DuckDB Database]
            L3E[lab3_env]
        end
    end

    S -.->|Shared Utils| L1
    S -.->|Shared Utils| L2
    S -.->|Shared Utils| L3

    style S fill:#e0e7ff
    style L1 fill:#dbeafe
    style L2 fill:#dcfce7
    style L3 fill:#fef3c7

Key Features

Multi-Lab Architecture with Per-Lab Environments

Each lab is completely independent:

buildcpg-labs/
โ”œโ”€โ”€ shared/                     # Reusable utilities (ALL labs)
โ”‚   โ”œโ”€โ”€ utils/                  # DataInspector, CSVMonitor
โ”‚   โ””โ”€โ”€ config/                 # Central configuration
โ”‚
โ”œโ”€โ”€ lab1_sales_performance/     # Lab 1 (Independent)
โ”‚   โ”œโ”€โ”€ lab1_env/              # Own virtual environment
โ”‚   โ”œโ”€โ”€ data/                  # Own database
โ”‚   โ””โ”€โ”€ dbt/                   # Own models
โ”‚
โ”œโ”€โ”€ lab2_market_sentiment/      # Lab 2 (Independent)
โ”‚   โ”œโ”€โ”€ lab2_env/              # Own virtual environment
โ”‚   โ”œโ”€โ”€ data/                  # Own database
โ”‚   โ””โ”€โ”€ dbt/                   # Own models
โ”‚
โ””โ”€โ”€ lab3_customer_segmentation/ # Lab 3 (Independent)
    โ”œโ”€โ”€ lab3_env/              # Own virtual environment
    โ”œโ”€โ”€ data/                  # Own database
    โ””โ”€โ”€ dbt/                   # Own models

Architecture Benefits: - โœ… Complete isolation - Each lab has dedicated environment - โœ… Independent data - Each lab has its own DuckDB database - โœ… Dependency freedom - Labs can use different package versions - โœ… No conflicts - Work on multiple labs simultaneously - โœ… Shared utilities - Common code available to all labs

Current Labs

Lab 1: Sales Performance Analysis

Status: โœ… Active
Purpose: Analyze sales data with medallion architecture (Bronze โ†’ Silver โ†’ Gold)
Features: - Sales data processing - Performance metrics - Trend analysis

Documentation: See Lab 1 Overview


Lab 2: Market Sentiment Analysis

Status: โœ… Active Development
Purpose: Real-time CPG brand reputation monitoring
Features: - Reddit & News sentiment ingestion - 5 dbt models with incremental processing - 14 automated data quality tests - Anomaly detection via z-scores - Daily sentiment aggregations

Key Metrics: - ๐Ÿ“Š 800 sentiment events - ๐Ÿงช 14/14 tests passing - ๐ŸŽฏ 5 CPG brands monitored - โšก ~3 second build time

Technology Stack:

graph LR
    A[Python 3.11] --> B[dbt 1.7.0]
    B --> C[DuckDB 0.9.1]
    C --> D[Data Quality Tests]

    style A fill:#3b82f6
    style B fill:#10b981
    style C fill:#f59e0b
    style D fill:#8b5cf6

Documentation: - Lab 2 Overview - Architecture & features - Lab 2 Setup - Installation guide - Lab 2 Data Models - Complete model reference - Lab 2 Troubleshooting - Issue solutions - Lab 2 Quick Reference - Common commands


Lab 3: Customer Segmentation

Status: ๐Ÿ“‹ Planned
Purpose: Customer behavior analysis and segmentation
Features: (Coming soon)

Platform Statistics

Metric Value
Active Labs 2
Total Data Models 10+
Data Quality Tests 20+
Shared Utilities 5
Documentation Pages 15+

Quick Start

1. Clone Repository

git clone https://github.com/narensham/buildcpg-labs.git
cd buildcpg-labs

2. Work with Lab 2 (Example)

# Navigate to lab
cd lab2_market_sentiment

# Create virtual environment
python3 -m venv lab2_env
source lab2_env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Setup dbt
cd dbt
dbt deps
dbt debug

# Generate sample data
cd ..
python pipelines/ingest_sentiment.py

# Run pipeline
cd dbt
dbt build

3. Verify Success

# All tests should pass
dbt test

# Expected output:
# โœ… PASS=14 WARN=0 ERROR=0 SKIP=0 TOTAL=14

Platform Structure

buildcpg-labs/
โ”‚
โ”œโ”€โ”€ shared/                          # Shared utilities
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ data_inspector.py       # Database inspection
โ”‚   โ”‚   โ”œโ”€โ”€ csv_monitor.py          # Data change detection
โ”‚   โ”‚   โ””โ”€โ”€ config_loader.py        # Configuration management
โ”‚   โ”œโ”€โ”€ config/
โ”‚   โ”‚   โ”œโ”€โ”€ labs_config.yaml        # Lab registry
โ”‚   โ”‚   โ””โ”€โ”€ paths.py                # Path helpers
โ”‚   โ””โ”€โ”€ templates/                  # Lab templates
โ”‚
โ”œโ”€โ”€ lab1_sales_performance/         # Independent Lab 1
โ”‚   โ”œโ”€โ”€ lab1_env/                   # Virtual environment
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ”œโ”€โ”€ raw/                    # Source data
โ”‚   โ”‚   โ””โ”€โ”€ lab1.duckdb             # Database
โ”‚   โ”œโ”€โ”€ dbt/
โ”‚   โ”‚   โ”œโ”€โ”€ models/                 # Transformations
โ”‚   โ”‚   โ”œโ”€โ”€ tests/                  # Quality tests
โ”‚   โ”‚   โ””โ”€โ”€ profiles.yml            # DB connection
โ”‚   โ”œโ”€โ”€ pipelines/                  # Data ingestion
โ”‚   โ””โ”€โ”€ requirements.txt            # Dependencies
โ”‚
โ”œโ”€โ”€ lab2_market_sentiment/          # Independent Lab 2
โ”‚   โ”œโ”€โ”€ lab2_env/                   # Virtual environment
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ”œโ”€โ”€ raw/                    # Source data
โ”‚   โ”‚   โ””โ”€โ”€ lab2.duckdb             # Database
โ”‚   โ”œโ”€โ”€ dbt/
โ”‚   โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ staging/           # Bronze layer
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ intermediate/      # Silver layer
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ mart/              # Gold layer
โ”‚   โ”‚   โ”œโ”€โ”€ macros/                # Reusable SQL
โ”‚   โ”‚   โ”œโ”€โ”€ tests/                 # Quality tests
โ”‚   โ”‚   โ””โ”€โ”€ schema.yml             # Contracts
โ”‚   โ”œโ”€โ”€ pipelines/
โ”‚   โ”‚   โ””โ”€โ”€ ingest_sentiment.py   # Data generation
โ”‚   โ””โ”€โ”€ requirements.txt           # Dependencies
โ”‚
โ”œโ”€โ”€ docs/                           # Documentation (MkDocs)
โ”‚   โ”œโ”€โ”€ architecture/
โ”‚   โ”œโ”€โ”€ getting-started/
โ”‚   โ”œโ”€โ”€ labs/
โ”‚   โ”‚   โ”œโ”€โ”€ lab1-*.md
โ”‚   โ”‚   โ””โ”€โ”€ lab2-*.md              # Lab 2 documentation
โ”‚   โ””โ”€โ”€ utilities/
โ”‚
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ docs.yml               # Auto-deploy docs
โ”‚
โ”œโ”€โ”€ mkdocs.yml                      # Documentation config
โ””โ”€โ”€ README.md

Technology Stack

Layer Technology Purpose
Database DuckDB 0.9.1 Embedded, Mac-compatible, no Docker
Transformation dbt 1.7.0 Data modeling & testing
Language Python 3.11+ Scripting & ingestion
Testing dbt_expectations Data quality validation
Documentation MkDocs Material Auto-generated docs
CI/CD GitHub Actions Automated deployments
Orchestration Manual (Airflow planned) Pipeline scheduling

Design Principles

1. Lab Independence

graph LR
    L1[Lab 1] -.->|Uses| S[Shared Utils]
    L2[Lab 2] -.->|Uses| S
    L3[Lab 3] -.->|Uses| S

    L1 x--x L2
    L2 x--x L3
    L3 x--x L1

    style S fill:#e0e7ff
    style L1 fill:#dbeafe
    style L2 fill:#dcfce7
    style L3 fill:#fef3c7
- Each lab has own database - Each lab has own environment - Labs don't interfere with each other

2. Medallion Architecture

All labs follow Bronze โ†’ Silver โ†’ Gold pattern: - Bronze (Staging): Raw data, minimal transformation - Silver (Intermediate): Cleaned, enriched, business logic - Gold (Marts): Analytics-ready aggregates

3. Data Quality First

  • Automated tests on every run
  • Contract enforcement where needed
  • Quality flags and validation
  • Comprehensive test coverage

4. Documentation Driven

  • Every lab fully documented
  • Architecture diagrams
  • Troubleshooting guides
  • Quick reference cards
  • Auto-deployed to GitHub Pages

Workflow Example (Lab 2)

sequenceDiagram
    participant D as Developer
    participant P as Python Script
    participant CSV as CSV Files
    participant DBT as dbt
    participant DB as DuckDB
    participant T as Tests

    D->>P: python ingest_sentiment.py
    P->>CSV: Generate data
    D->>DBT: dbt run
    DBT->>CSV: Read raw data
    DBT->>DB: Transform (5 models)
    D->>DBT: dbt test
    DBT->>T: Run 14 tests
    T-->>D: โœ… All pass

Common Tasks

Check Lab Status

# From buildcpg-labs root
ls -la | grep lab

# Lab specific status
cd lab2_market_sentiment
dbt debug
dbt list

Run Specific Lab

cd lab2_market_sentiment
source lab2_env/bin/activate
cd dbt
dbt run
dbt test

View Documentation

# Serve locally
mkdocs serve

# Visit: http://127.0.0.1:8000

Create New Lab (Future)

./setup_new_lab.sh lab3_customer_segmentation

Documentation

Getting Started

Architecture

Lab Documentation

Utilities

Support

Project Status

Phase 1: Foundation โœ… COMPLETE

  • โœ… Shared utilities (DataInspector, CSVMonitor)
  • โœ… Central configuration
  • โœ… Lab1 working
  • โœ… Lab2 working with full documentation

Phase 2: Enhanced Lab 2 โœ… COMPLETE

  • โœ… Market sentiment analysis pipeline
  • โœ… 5 dbt models with incremental processing
  • โœ… 14 automated data quality tests
  • โœ… Comprehensive documentation with diagrams
  • โœ… Troubleshooting guide
  • โœ… Quick reference card
  • โœ… Reference architecture diagrams

Phase 3: Platform Expansion ๐Ÿ”„ IN PROGRESS

  • ๐Ÿ”„ Lab 3 (Customer Segmentation)
  • ๐Ÿ”„ Streamlit dashboards
  • ๐Ÿ”„ Real API integrations (PRAW, NewsAPI)
  • ๐Ÿ“‹ Advanced sentiment analysis (Hugging Face)

Phase 4: Production Readiness ๐Ÿ“‹ PLANNED

  • ๐Ÿ“‹ Airflow orchestration
  • ๐Ÿ“‹ CI/CD for data pipelines
  • ๐Ÿ“‹ Monitoring & alerting
  • ๐Ÿ“‹ Data quality gates
  • ๐Ÿ“‹ Performance optimization

What's New in Lab 2

Key Achievements

  1. Complete Sentiment Pipeline
  2. Reddit + News data ingestion
  3. 5-layer transformation (Staging โ†’ Intermediate โ†’ Marts)
  4. Incremental processing for efficiency

  5. Robust Data Quality

  6. 14 automated tests (100% passing)
  7. Contract enforcement on critical models
  8. Quality flags and validation
  9. Anomaly detection

  10. Comprehensive Documentation

  11. Full architecture diagrams (Mermaid)
  12. Step-by-step setup guide
  13. Complete model reference with schemas
  14. Troubleshooting for all issues encountered
  15. Quick reference card

  16. Battle-Tested Solutions

  17. Resolved duplicate key issues
  18. Fixed contract enforcement problems
  19. Solved CTE reference errors
  20. Optimized incremental logic

Lab 2 Metrics

  • Models: 5 (2 staging, 1 intermediate, 2 marts)
  • Tests: 14 (all passing)
  • Data Quality: 100%
  • Documentation: 5 comprehensive guides
  • Build Time: ~3 seconds
  • Code Coverage: All models documented

Learning Path

Beginners

  1. Read Quick Start
  2. Complete Lab 1 setup
  3. Understand Medallion Architecture

Intermediate

  1. Complete Lab 2 setup
  2. Study Lab 2 Data Models
  3. Learn dbt best practices

Advanced

  1. Customize Lab 2 for real data sources
  2. Build Lab 3 from scratch
  3. Implement Airflow orchestration
  4. Create production dashboards

Contributing

Report Issues

Found a bug or have a suggestion? - Create a GitHub issue - Include error messages and steps to reproduce - Reference relevant documentation

Improve Documentation

  • Fix typos or unclear sections
  • Add examples or diagrams
  • Share your use cases

Support


Requirements

  • Python: 3.11+
  • OS: Mac 11+, Linux, or Windows with WSL2
  • Memory: 2GB minimum
  • Disk: 1GB base + data per lab
  • Skills: Basic terminal, SQL, Python

Quick Wins

Get started in 10 minutes:

# Clone
git clone https://github.com/narensham/buildcpg-labs.git
cd buildcpg-labs/lab2_market_sentiment

# Setup
python3 -m venv lab2_env && source lab2_env/bin/activate
pip install -r requirements.txt

# Run
python pipelines/ingest_sentiment.py
cd dbt && dbt deps && dbt build

# Success!
# โœ… 14/14 tests passing


Platform: Multi-Lab Data Engineering
Architecture: Independent labs + shared utilities
Current Labs: 2 active, 1 planned
Last Updated: November 2025
Maintainer: narensham
Repository: GitHub
Documentation: GitHub Pages