New Data Pipeline Design

Redesign from file-only (Parquet-on-S3) to SQL+S3 hybrid architecture.

See PIPELINE_ARCHITECTURE.md for the current pipeline reference.

Table of Contents

Section	Description
1. Motivation	Why this redesign exists — performance bottlenecks, RAM limits, dual-source conflicts, multi-tenancy, and data cyclicity.
2. Design Principles	24 design principles governing the new architecture — from idempotent transforms to aggressive DB-level constraints.
3–4. Architecture	High-level data flow diagram showing ERP → S3 → Transform → SQL Database → ML Output, plus database schema organization.
5–6. Transform & Load (Legacy)	Detailed Step 3 (Transform & Load) flowchart and the legacy Phase 2 aggregation / enrichment reference.
7–8. Storage & Recovery	S3 storage layout conventions and disaster recovery strategies (fast restore vs. full rebuild).
9a. ERD — Core	Entity-relationship diagrams for core entities: products, brands, categories, suppliers, locations, pricing, customers, and settings.
9b. ERD — Orders	ERD for order entities: sales orders, purchase orders, receiving orders, store orders, and transfer orders.
9c. ERD — Behaviors	ERD for behaviors (product, supplier, location, category), demand forecasting, inventory targets, and PO/TO recommendations.
9d. ERD — Global	ERD for the global schema (cw_global): geography, weather, currencies, client configuration, users, and roles.
10. Statistics	UI statistics schema — 16 tables (4 metrics × 4 granularities) in the _stats schema.
11. Pipeline Steps	Pipeline chronology — 13 steps from data ingestion through uplift monitoring, with the master flowchart.
12–13. Operations	UI change impact analysis (cascade / moderate / local / gate severity levels) and recommended database indexes.

Resources

Interactive ERD Viewer — full entity-relationship diagram (D2-based, 115 tables, 176 relationships)
Source Markdown — single-file version of this documentation

1. Motivation →