CatWing Pipeline Design

New Data Pipeline Design

Redesign from file-only (Parquet-on-S3) to SQL+S3 hybrid architecture.

See PIPELINE_ARCHITECTURE.md for the current pipeline reference.

Table of Contents

SectionDescription
1. MotivationWhy this redesign exists — performance bottlenecks, RAM limits, dual-source conflicts, multi-tenancy, and data cyclicity.
2. Design Principles24 design principles governing the new architecture — from idempotent transforms to aggressive DB-level constraints.
3–4. ArchitectureHigh-level data flow diagram showing ERP → S3 → Transform → SQL Database → ML Output, plus database schema organization.
5–6. Transform & Load (Legacy)Detailed Step 3 (Transform & Load) flowchart and the legacy Phase 2 aggregation / enrichment reference.
7–8. Storage & RecoveryS3 storage layout conventions and disaster recovery strategies (fast restore vs. full rebuild).
9a. ERD — CoreEntity-relationship diagrams for core entities: products, brands, categories, suppliers, locations, pricing, customers, and settings.
9b. ERD — OrdersERD for order entities: sales orders, purchase orders, receiving orders, store orders, and transfer orders.
9c. ERD — BehaviorsERD for behaviors (product, supplier, location, category), demand forecasting, inventory targets, and PO/TO recommendations.
9d. ERD — GlobalERD for the global schema (cw_global): geography, weather, currencies, client configuration, users, and roles.
10. StatisticsUI statistics schema — 16 tables (4 metrics × 4 granularities) in the _stats schema.
11. Pipeline StepsPipeline chronology — 13 steps from data ingestion through uplift monitoring, with the master flowchart.
12–13. OperationsUI change impact analysis (cascade / moderate / local / gate severity levels) and recommended database indexes.

Resources