New Data Pipeline Design
Redesign from file-only (Parquet-on-S3) to SQL+S3 hybrid architecture.
See PIPELINE_ARCHITECTURE.md for the current pipeline
reference.
Table of Contents
| Section | Description |
|---|---|
| 1. Motivation | Why this redesign exists — performance bottlenecks, RAM limits, dual-source conflicts, multi-tenancy, and data cyclicity. |
| 2. Design Principles | 24 design principles governing the new architecture — from idempotent transforms to aggressive DB-level constraints. |
| 3–4. Architecture | High-level data flow diagram showing ERP → S3 → Transform → SQL Database → ML Output, plus database schema organization. |
| 5–6. Transform & Load (Legacy) | Detailed Step 3 (Transform & Load) flowchart and the legacy Phase 2 aggregation / enrichment reference. |
| 7–8. Storage & Recovery | S3 storage layout conventions and disaster recovery strategies (fast restore vs. full rebuild). |
| 9a. ERD — Core | Entity-relationship diagrams for core entities: products, brands, categories, suppliers, locations, pricing, customers, and settings. |
| 9b. ERD — Orders | ERD for order entities: sales orders, purchase orders, receiving orders, store orders, and transfer orders. |
| 9c. ERD — Behaviors | ERD for behaviors (product, supplier, location, category), demand forecasting, inventory targets, and PO/TO recommendations. |
| 9d. ERD — Global | ERD for the global schema (cw_global): geography, weather, currencies, client configuration, users, and roles. |
| 10. Statistics | UI statistics schema — 16 tables (4 metrics × 4 granularities) in the _stats schema. |
| 11. Pipeline Steps | Pipeline chronology — 13 steps from data ingestion through uplift monitoring, with the master flowchart. |
| 12–13. Operations | UI change impact analysis (cascade / moderate / local / gate severity levels) and recommended database indexes. |
Resources
- Interactive ERD Viewer — full entity-relationship diagram (D2-based, 115 tables, 176 relationships)
- Source Markdown — single-file version of this documentation