Architecture Overview¶

System Design¶

BuildCPG Labs uses a multi-lab, shared-utilities architecture designed for scalability and independence.

┌─────────────────────────────────────────────────────────┐
│         ORCHESTRATION LAYER (Airflow/Prefect)           │
│  Schedules pipelines, monitors health, handles alerts   │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│              DATA INGESTION LAYER                        │
│  • CSV monitoring • API polling • Data validation       │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│         dbt TRANSFORMATION LAYER (Per Lab)              │
│  • Bronze (Raw) → Silver (Cleaned) → Gold (Analytics)  │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│              QUALITY ASSURANCE LAYER                     │
│  • dbt tests • Data freshness • Quality scoring         │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│         SHARED UTILITIES LAYER (Root Level)             │
│  • DataInspector • CSVMonitor • Path Helpers            │
└─────────────────────────────────────────────────────────┘

Directory Structure (Current Setup)¶

buildcpg-labs/
│
├── .venv/                           # SINGLE venv (shared by all labs)
│   ├── bin/
│   ├── lib/
│   └── ...
│
├── shared/                          # Shared across ALL labs
│   ├── __init__.py
│   ├── utils/
│   │   ├── data_inspector.py       # Inspect databases
│   │   ├── csv_monitor.py          # Detect new data
│   │   └── config_loader.py        # Load configurations
│   │
│   ├── config/                      # Config inside shared (not at root)
│   │   ├── labs_config.yaml        # Central lab registry
│   │   └── paths.py                # Path helpers
│   │
│   ├── data_quality/
│   │   ├── validators.py           # Quality validators
│   │   └── expectations.py         # Data expectations
│   │
│   └── templates/
│       ├── Makefile_template       # Template Makefile
│       ├── requirements_template.txt
│       ├── dbt_project_template.yml
│       └── .gitignore_template
│
├── lab1_sales_performance/         # LAB 1 (Independent data, shares venv)
│   ├── dbt/
│   │   ├── dbt_project.yml
│   │   ├── profiles.yml            # Manually edited when switching labs
│   │   ├── models/
│   │   │   ├── staging/
│   │   │   ├── intermediate/
│   │   │   └── marts/
│   │   └── tests/
│   │
│   ├── data/
│   │   ├── raw/
│   │   └── lab1_sales_performance.duckdb
│   │
│   ├── scripts/
│   │   ├── inspect_data.py
│   │   └── check_for_new_data.py
│   │
│   ├── pipelines/
│   │   ├── data_ingestion.py
│   │   └── data_quality.py
│   │
│   ├── requirements.txt            # Shared dependencies
│   └── .gitignore
│
├── lab2_forecast_model/            # LAB 2 (Independent data, shares venv)
│   └── (Same structure as lab1)
│
├── lab3_customer_segmentation/     # LAB 3 (Independent data, shares venv)
│   └── (Same structure as lab1)
│
├── orchestration/
│   └── airflow_dags.py             # Multi-lab orchestration
│
├── docs/                           # This documentation
├── setup_new_lab.sh                # Bootstrap new labs
├── mkdocs.yml                      # Documentation config
├── .gitignore
└── README.md

Key Concepts¶

Single Virtual Environment Approach¶

Current Setup: - ONE .venv/ at root shared by all labs - All labs use same Python packages - Switch between labs by changing dbt profiles manually

Advantages: - Less disk space (~500MB vs ~500MB per lab) - Consistent package versions across all labs - Simpler initial setup

Trade-offs: - Cannot have labs with conflicting dependencies - Must manually edit profiles.yml when switching labs - Risk of profile switching errors (writing to wrong database) - No concurrent work on different labs

⚠️ See Current Setup Details for full pros/cons analysis

Labs Have Independent Data¶

Each lab has its own database (lab1.duckdb, lab2.duckdb)
Each lab has its own dbt project (dbt/models/, dbt/dbt_project.yml)
Each lab has its own raw data (data/raw/)
Labs share Python environment but NOT data

Shared Utilities¶

DataInspector - Check database quality (used by all labs)
CSVMonitor - Detect new data in CSVs (used by all labs)
Config Paths - Get paths for any lab (used by all labs)
Written once in shared/, used by all labs
Bug fix in shared code fixes all labs

Configuration Location¶

Config is inside shared/config/ (not at root level):

# Import pattern for lab scripts
import sys
sys.path.insert(0, '../..')
from shared.config.paths import get_lab_db_path  # Note: shared.config

# NOT this:
# from config.paths import get_lab_db_path  # ❌ Wrong path

# shared/config/labs_config.yaml
labs:
  lab1_sales_performance:
    path: lab1_sales_performance
    db_path: lab1_sales_performance/data/lab1_sales_performance.duckdb
    dbt_path: lab1_sales_performance/dbt

  lab2_forecast_model:
    path: lab2_forecast_model
    db_path: lab2_forecast_model/data/lab2_forecast_model.duckdb
    dbt_path: lab2_forecast_model/dbt

This registry tells the system where each lab is and how to find its resources.

Data Flow¶

Single Lab Example¶

CSV Input
    ↓
dbt Load (raw schema)
    ↓
dbt Transform (bronze → silver → gold)
    ↓
DuckDB Tables
    ↓
DataInspector (quality check)
    ↓
BI Tool (Tableau/Looker)

Multiple Labs Orchestrated¶

Airflow DAG (Daily 12 AM)
    ├── Lab1: Check CSV → Load → Transform → Test → Inspect
    ├── Lab2: Check CSV → Load → Transform → Test → Inspect
    └── Lab3: Check CSV → Load → Transform → Test → Inspect
    ↓
All results aggregated
    ↓
Alert team if any lab fails

Workflow: Switching Between Labs¶

# 1. Activate shared venv (once per session)
cd buildcpg-labs
source .venv/bin/activate

# 2. Work on lab1
cd lab1_sales_performance/dbt
# profiles.yml should point to: ../data/lab1_sales_performance.duckdb
dbt debug  # Verify correct database
dbt run

# 3. Switch to lab2
cd ../../lab2_forecast_model/dbt
# Edit profiles.yml to point to: ../data/lab2_forecast_model.duckdb
vim profiles.yml  # Update path
dbt debug  # Verify correct database
dbt run

⚠️ Critical: Always run dbt debug before dbt run to verify you're pointing to the correct database.

Technology Stack¶

Layer	Technology
Database	DuckDB (embedded, Mac compatible)
Transformation	dbt (data build tool)
Scripting	Python 3.11+
Environment	Single venv (shared)
Orchestration	Airflow (optional, future)
Version Control	Git
Documentation	MkDocs

Design Principles¶

1. Data Independence¶

Labs have separate databases and raw data. One lab's data corruption doesn't affect others.

2. Shared Python Environment¶

All labs use same venv for consistency and space efficiency (with trade-offs).

3. Reusability¶

Code written for shared utilities is used by all labs without duplication.

4. Scalability¶

Adding lab 10 takes same effort as adding lab 2 (but dependency conflicts may limit this).

5. Clarity¶

Each lab's purpose is clear. Shared code's purpose is clear.

6. Manual Coordination¶

Profile switching requires discipline and verification steps.

Comparison: Current vs Alternative Architectures¶

Current Setup (Single venv)¶

✅ Space efficient (one venv)
✅ Consistent packages
✅ Simple setup
❌ Dependency conflicts possible
❌ Manual profile switching
❌ No concurrent work

Alternative: Per-Lab venvs¶

❌ More disk space
❌ More complex setup
✅ Complete isolation
✅ Different dependencies per lab
✅ Automatic profile management
✅ Concurrent work safe

When to migrate: See Current Setup Analysis

Migration Path¶

Current State: Phase 1 (Single venv)¶

Shared utilities ✅
Central configuration ✅
Lab1 working ✅
Single venv ✅

Future: Phase 2 (Optional migration to per-lab venvs)¶

When you hit: - 3+ labs - Dependency conflicts - Multiple team members - Production requirements

Then consider: Per-lab venvs + automated profile management