BuildCPG Labs¶
A modern, scalable data engineering platform for managing multiple independent labs using dbt, DuckDB, and Python.
Overview¶
BuildCPG Labs enables you to:
- Run independent labs - Each lab has its own database, data, and Python environment
- Share utilities - Common code used by all labs without duplication
- Scale easily - Create new labs in minutes using templates
- Maintain quality - Built-in data inspection and automated quality checks
- Work efficiently - Per-lab virtual environments with standardized workflows
Architecture¶
graph TB
subgraph "BuildCPG Labs Platform"
S[shared/<br/>Utilities & Templates]
subgraph "Lab 1 - Sales Performance"
L1[dbt Models]
L1D[DuckDB Database]
L1E[lab1_env]
end
subgraph "Lab 2 - Market Sentiment"
L2[dbt Models]
L2D[DuckDB Database]
L2E[lab2_env]
end
subgraph "Lab 3 - Customer Segmentation"
L3[dbt Models]
L3D[DuckDB Database]
L3E[lab3_env]
end
end
S -.->|Shared Utils| L1
S -.->|Shared Utils| L2
S -.->|Shared Utils| L3
style S fill:#e0e7ff
style L1 fill:#dbeafe
style L2 fill:#dcfce7
style L3 fill:#fef3c7
Key Features¶
Multi-Lab Architecture with Per-Lab Environments¶
Each lab is completely independent:
buildcpg-labs/
โโโ shared/ # Reusable utilities (ALL labs)
โ โโโ utils/ # DataInspector, CSVMonitor
โ โโโ config/ # Central configuration
โ
โโโ lab1_sales_performance/ # Lab 1 (Independent)
โ โโโ lab1_env/ # Own virtual environment
โ โโโ data/ # Own database
โ โโโ dbt/ # Own models
โ
โโโ lab2_market_sentiment/ # Lab 2 (Independent)
โ โโโ lab2_env/ # Own virtual environment
โ โโโ data/ # Own database
โ โโโ dbt/ # Own models
โ
โโโ lab3_customer_segmentation/ # Lab 3 (Independent)
โโโ lab3_env/ # Own virtual environment
โโโ data/ # Own database
โโโ dbt/ # Own models
Architecture Benefits: - โ Complete isolation - Each lab has dedicated environment - โ Independent data - Each lab has its own DuckDB database - โ Dependency freedom - Labs can use different package versions - โ No conflicts - Work on multiple labs simultaneously - โ Shared utilities - Common code available to all labs
Current Labs¶
Lab 1: Sales Performance Analysis¶
Status: โ
Active
Purpose: Analyze sales data with medallion architecture (Bronze โ Silver โ Gold)
Features:
- Sales data processing
- Performance metrics
- Trend analysis
Documentation: See Lab 1 Overview
Lab 2: Market Sentiment Analysis¶
Status: โ
Active Development
Purpose: Real-time CPG brand reputation monitoring
Features:
- Reddit & News sentiment ingestion
- 5 dbt models with incremental processing
- 14 automated data quality tests
- Anomaly detection via z-scores
- Daily sentiment aggregations
Key Metrics: - ๐ 800 sentiment events - ๐งช 14/14 tests passing - ๐ฏ 5 CPG brands monitored - โก ~3 second build time
Technology Stack:
graph LR
A[Python 3.11] --> B[dbt 1.7.0]
B --> C[DuckDB 0.9.1]
C --> D[Data Quality Tests]
style A fill:#3b82f6
style B fill:#10b981
style C fill:#f59e0b
style D fill:#8b5cf6
Documentation: - Lab 2 Overview - Architecture & features - Lab 2 Setup - Installation guide - Lab 2 Data Models - Complete model reference - Lab 2 Troubleshooting - Issue solutions - Lab 2 Quick Reference - Common commands
Lab 3: Customer Segmentation¶
Status: ๐ Planned
Purpose: Customer behavior analysis and segmentation
Features: (Coming soon)
Platform Statistics¶
| Metric | Value |
|---|---|
| Active Labs | 2 |
| Total Data Models | 10+ |
| Data Quality Tests | 20+ |
| Shared Utilities | 5 |
| Documentation Pages | 15+ |
Quick Start¶
1. Clone Repository¶
2. Work with Lab 2 (Example)¶
# Navigate to lab
cd lab2_market_sentiment
# Create virtual environment
python3 -m venv lab2_env
source lab2_env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Setup dbt
cd dbt
dbt deps
dbt debug
# Generate sample data
cd ..
python pipelines/ingest_sentiment.py
# Run pipeline
cd dbt
dbt build
3. Verify Success¶
Platform Structure¶
buildcpg-labs/
โ
โโโ shared/ # Shared utilities
โ โโโ utils/
โ โ โโโ data_inspector.py # Database inspection
โ โ โโโ csv_monitor.py # Data change detection
โ โ โโโ config_loader.py # Configuration management
โ โโโ config/
โ โ โโโ labs_config.yaml # Lab registry
โ โ โโโ paths.py # Path helpers
โ โโโ templates/ # Lab templates
โ
โโโ lab1_sales_performance/ # Independent Lab 1
โ โโโ lab1_env/ # Virtual environment
โ โโโ data/
โ โ โโโ raw/ # Source data
โ โ โโโ lab1.duckdb # Database
โ โโโ dbt/
โ โ โโโ models/ # Transformations
โ โ โโโ tests/ # Quality tests
โ โ โโโ profiles.yml # DB connection
โ โโโ pipelines/ # Data ingestion
โ โโโ requirements.txt # Dependencies
โ
โโโ lab2_market_sentiment/ # Independent Lab 2
โ โโโ lab2_env/ # Virtual environment
โ โโโ data/
โ โ โโโ raw/ # Source data
โ โ โโโ lab2.duckdb # Database
โ โโโ dbt/
โ โ โโโ models/
โ โ โ โโโ staging/ # Bronze layer
โ โ โ โโโ intermediate/ # Silver layer
โ โ โ โโโ mart/ # Gold layer
โ โ โโโ macros/ # Reusable SQL
โ โ โโโ tests/ # Quality tests
โ โ โโโ schema.yml # Contracts
โ โโโ pipelines/
โ โ โโโ ingest_sentiment.py # Data generation
โ โโโ requirements.txt # Dependencies
โ
โโโ docs/ # Documentation (MkDocs)
โ โโโ architecture/
โ โโโ getting-started/
โ โโโ labs/
โ โ โโโ lab1-*.md
โ โ โโโ lab2-*.md # Lab 2 documentation
โ โโโ utilities/
โ
โโโ .github/
โ โโโ workflows/
โ โโโ docs.yml # Auto-deploy docs
โ
โโโ mkdocs.yml # Documentation config
โโโ README.md
Technology Stack¶
| Layer | Technology | Purpose |
|---|---|---|
| Database | DuckDB 0.9.1 | Embedded, Mac-compatible, no Docker |
| Transformation | dbt 1.7.0 | Data modeling & testing |
| Language | Python 3.11+ | Scripting & ingestion |
| Testing | dbt_expectations | Data quality validation |
| Documentation | MkDocs Material | Auto-generated docs |
| CI/CD | GitHub Actions | Automated deployments |
| Orchestration | Manual (Airflow planned) | Pipeline scheduling |
Design Principles¶
1. Lab Independence¶
graph LR
L1[Lab 1] -.->|Uses| S[Shared Utils]
L2[Lab 2] -.->|Uses| S
L3[Lab 3] -.->|Uses| S
L1 x--x L2
L2 x--x L3
L3 x--x L1
style S fill:#e0e7ff
style L1 fill:#dbeafe
style L2 fill:#dcfce7
style L3 fill:#fef3c7
- Each lab has own database
- Each lab has own environment
- Labs don't interfere with each other
2. Medallion Architecture¶
All labs follow Bronze โ Silver โ Gold pattern: - Bronze (Staging): Raw data, minimal transformation - Silver (Intermediate): Cleaned, enriched, business logic - Gold (Marts): Analytics-ready aggregates
3. Data Quality First¶
- Automated tests on every run
- Contract enforcement where needed
- Quality flags and validation
- Comprehensive test coverage
4. Documentation Driven¶
- Every lab fully documented
- Architecture diagrams
- Troubleshooting guides
- Quick reference cards
- Auto-deployed to GitHub Pages
Workflow Example (Lab 2)¶
sequenceDiagram
participant D as Developer
participant P as Python Script
participant CSV as CSV Files
participant DBT as dbt
participant DB as DuckDB
participant T as Tests
D->>P: python ingest_sentiment.py
P->>CSV: Generate data
D->>DBT: dbt run
DBT->>CSV: Read raw data
DBT->>DB: Transform (5 models)
D->>DBT: dbt test
DBT->>T: Run 14 tests
T-->>D: โ
All pass
Common Tasks¶
Check Lab Status¶
# From buildcpg-labs root
ls -la | grep lab
# Lab specific status
cd lab2_market_sentiment
dbt debug
dbt list
Run Specific Lab¶
View Documentation¶
Create New Lab (Future)¶
Documentation¶
Getting Started¶
Architecture¶
Lab Documentation¶
- Lab 1:
- Overview
- Setup
-
Lab 2: โญ NEW
- Overview
- Setup
- Data Models
- Troubleshooting
- Quick Reference
Utilities¶
Support¶
Project Status¶
Phase 1: Foundation โ COMPLETE¶
- โ Shared utilities (DataInspector, CSVMonitor)
- โ Central configuration
- โ Lab1 working
- โ Lab2 working with full documentation
Phase 2: Enhanced Lab 2 โ COMPLETE¶
- โ Market sentiment analysis pipeline
- โ 5 dbt models with incremental processing
- โ 14 automated data quality tests
- โ Comprehensive documentation with diagrams
- โ Troubleshooting guide
- โ Quick reference card
- โ Reference architecture diagrams
Phase 3: Platform Expansion ๐ IN PROGRESS¶
- ๐ Lab 3 (Customer Segmentation)
- ๐ Streamlit dashboards
- ๐ Real API integrations (PRAW, NewsAPI)
- ๐ Advanced sentiment analysis (Hugging Face)
Phase 4: Production Readiness ๐ PLANNED¶
- ๐ Airflow orchestration
- ๐ CI/CD for data pipelines
- ๐ Monitoring & alerting
- ๐ Data quality gates
- ๐ Performance optimization
What's New in Lab 2¶
Key Achievements¶
- Complete Sentiment Pipeline
- Reddit + News data ingestion
- 5-layer transformation (Staging โ Intermediate โ Marts)
-
Incremental processing for efficiency
-
Robust Data Quality
- 14 automated tests (100% passing)
- Contract enforcement on critical models
- Quality flags and validation
-
Anomaly detection
-
Comprehensive Documentation
- Full architecture diagrams (Mermaid)
- Step-by-step setup guide
- Complete model reference with schemas
- Troubleshooting for all issues encountered
-
Quick reference card
-
Battle-Tested Solutions
- Resolved duplicate key issues
- Fixed contract enforcement problems
- Solved CTE reference errors
- Optimized incremental logic
Lab 2 Metrics¶
- Models: 5 (2 staging, 1 intermediate, 2 marts)
- Tests: 14 (all passing)
- Data Quality: 100%
- Documentation: 5 comprehensive guides
- Build Time: ~3 seconds
- Code Coverage: All models documented
Learning Path¶
Beginners¶
- Read Quick Start
- Complete Lab 1 setup
- Understand Medallion Architecture
Intermediate¶
- Complete Lab 2 setup
- Study Lab 2 Data Models
- Learn dbt best practices
Advanced¶
- Customize Lab 2 for real data sources
- Build Lab 3 from scratch
- Implement Airflow orchestration
- Create production dashboards
Contributing¶
Report Issues¶
Found a bug or have a suggestion? - Create a GitHub issue - Include error messages and steps to reproduce - Reference relevant documentation
Improve Documentation¶
- Fix typos or unclear sections
- Add examples or diagrams
- Share your use cases
Support¶
- ๐ Read the docs
- โ Check FAQ
- ๐ง Troubleshooting guides
- ๐ GitHub Issues
Requirements¶
- Python: 3.11+
- OS: Mac 11+, Linux, or Windows with WSL2
- Memory: 2GB minimum
- Disk: 1GB base + data per lab
- Skills: Basic terminal, SQL, Python
Quick Wins¶
Get started in 10 minutes:
# Clone
git clone https://github.com/narensham/buildcpg-labs.git
cd buildcpg-labs/lab2_market_sentiment
# Setup
python3 -m venv lab2_env && source lab2_env/bin/activate
pip install -r requirements.txt
# Run
python pipelines/ingest_sentiment.py
cd dbt && dbt deps && dbt build
# Success!
# โ
14/14 tests passing
Platform: Multi-Lab Data Engineering
Architecture: Independent labs + shared utilities
Current Labs: 2 active, 1 planned
Last Updated: November 2025
Maintainer: narensham
Repository: GitHub
Documentation: GitHub Pages