# Plan: LIST Partition Support for generic_loader ## Objective Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`, driven by YAML configuration and compatible with the current streaming `COPY` load path. ## Current context - `generic_loader/load_sas.py` - `LoaderConfig` currently parses `filename`, `schemaname`, `tablename`, `if_exists`, `include`, and `exclude`. - Schema inference is based on `read_sas_preview()` plus `infer_schema()`. - `render_create_table()` emits one non-partitioned `CREATE TABLE` statement. - `create_table()` handles `if_exists=fail|replace|append` for a single table. - `copy_dataframes()` streams rows into the target table with `COPY ... FROM STDIN`. - `generic_loader/load_folder.py` - `FolderConfig` carries folder defaults. - `ClusterSpec` carries resolved per-cluster load settings. - `_ExplicitPattern` stores optional per-cluster overrides for explicit matches. - `discover_clusters()` resolves cluster inheritance for `if_exists`, `include`, and `exclude`. - `load_cluster()` infers schema from the first file, creates one table, then streams all files into it. - Existing warnings are emitted to stderr as `[warn] ...`; the codebase does not currently use the `logging` module. - This task is design-only in architect mode; no production Python changes are being made here. ## Assumptions and constraints - PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append. - Data continues to be copied to the parent table so PostgreSQL performs routing; `copy_dataframes()` behavior should remain unchanged. - Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset. - Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults. - `if_exists=append` must validate compatibility and skip partition creation. - Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing. ## Files and systems likely affected - `generic_loader/load_sas.py` - `generic_loader/load_folder.py` - `generic_loader/sample_config.yaml` - `generic_loader/sample_folder_config.yaml` - `generic_loader/PARTITION_DESIGN.md` - Potentially module/CLI docstrings in `generic_loader/load_sas.py` and `generic_loader/load_folder.py` ## Implementation approach 1. Extend YAML and dataclass config surfaces with `partition_by` and `max_partitions`. 2. Add partition-planning helpers that: - validate partition columns, - normalize partition values consistently with the existing `COPY` preparation rules, - discover cascading unique values across one file or an entire cluster, - count resulting child partition tables and emit a threshold warning. 3. Extend DDL rendering so the parent table can be declared with `PARTITION BY LIST (...)` and child tables can be emitted recursively with `CREATE TABLE ... PARTITION OF ... FOR VALUES IN (...)`. 4. Extend table creation rules: - `replace` drops the parent with `CASCADE` when partitioning is enabled and recreates the full tree, - `fail` errors if the parent exists, - `append` validates schema plus partition-key compatibility and does not create partitions. 5. Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass. 6. Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution. ## Risks and edge cases - High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning. - Empty strings in text columns currently become `NULL` on load because of `COPY ... NULL ''`; partition discovery must mirror that behavior or routing will be wrong. - Different raw values can collide after sanitization or truncation; deterministic disambiguation is required. - `NULL` partition values need explicit support in both DDL generation and child-table naming. - Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview. - Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately. ## Acceptance criteria - The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and `if_exists` semantics. - Single-file and folder flows are both covered, including per-cluster inheritance/override behavior. - Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly. - The design explicitly identifies what remains unchanged, especially the `COPY` routing path. ## Validation strategy - Cross-check the plan against the current call graph and responsibilities in `load_sas.py` and `load_folder.py`. - Prefer minimal-regression changes that preserve existing non-partitioned behavior. - Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output. ## Documentation updates required - Create `generic_loader/PARTITION_DESIGN.md` as the primary implementation-ready design artifact. - Include exact sample YAML snippets for single-file and folder loaders. - Document the dry-run cost change for partitioned loads and the `append` limitation that partitions are not auto-created.