# Plan: LIST Partition Support for generic_loader

## Objective

Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`, driven by YAML configuration and compatible with the current streaming `COPY` load path.

## Current context

- `generic_loader/load_sas.py`
  - `LoaderConfig` currently parses `filename`, `schemaname`, `tablename`, `if_exists`, `include`, and `exclude`.
  - Schema inference is based on `read_sas_preview()` plus `infer_schema()`.
  - `render_create_table()` emits one non-partitioned `CREATE TABLE` statement.
  - `create_table()` handles `if_exists=fail|replace|append` for a single table.
  - `copy_dataframes()` streams rows into the target table with `COPY ... FROM STDIN`.
- `generic_loader/load_folder.py`
  - `FolderConfig` carries folder defaults.
  - `ClusterSpec` carries resolved per-cluster load settings.
  - `_ExplicitPattern` stores optional per-cluster overrides for explicit matches.
  - `discover_clusters()` resolves cluster inheritance for `if_exists`, `include`, and `exclude`.
  - `load_cluster()` infers schema from the first file, creates one table, then streams all files into it.
- Existing warnings are emitted to stderr as `[warn] ...`; the codebase does not currently use the `logging` module.
- This task is design-only in architect mode; no production Python changes are being made here.

## Assumptions and constraints

- PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append.
- Data continues to be copied to the parent table so PostgreSQL performs routing; `copy_dataframes()` behavior should remain unchanged.
- Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset.
- Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults.
- `if_exists=append` must validate compatibility and skip partition creation.
- Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing.

## Files and systems likely affected

- `generic_loader/load_sas.py`
- `generic_loader/load_folder.py`
- `generic_loader/sample_config.yaml`
- `generic_loader/sample_folder_config.yaml`
- `generic_loader/PARTITION_DESIGN.md`
- Potentially module/CLI docstrings in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`

## Implementation approach

1. Extend YAML and dataclass config surfaces with `partition_by` and `max_partitions`.
2. Add partition-planning helpers that:
   - validate partition columns,
   - normalize partition values consistently with the existing `COPY` preparation rules,
   - discover cascading unique values across one file or an entire cluster,
   - count resulting child partition tables and emit a threshold warning.
3. Extend DDL rendering so the parent table can be declared with `PARTITION BY LIST (...)` and child tables can be emitted recursively with `CREATE TABLE ... PARTITION OF ... FOR VALUES IN (...)`.
4. Extend table creation rules:
   - `replace` drops the parent with `CASCADE` when partitioning is enabled and recreates the full tree,
   - `fail` errors if the parent exists,
   - `append` validates schema plus partition-key compatibility and does not create partitions.
5. Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass.
6. Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution.

## Risks and edge cases

- High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning.
- Empty strings in text columns currently become `NULL` on load because of `COPY ... NULL ''`; partition discovery must mirror that behavior or routing will be wrong.
- Different raw values can collide after sanitization or truncation; deterministic disambiguation is required.
- `NULL` partition values need explicit support in both DDL generation and child-table naming.
- Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview.
- Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately.

## Acceptance criteria

- The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and `if_exists` semantics.
- Single-file and folder flows are both covered, including per-cluster inheritance/override behavior.
- Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly.
- The design explicitly identifies what remains unchanged, especially the `COPY` routing path.

## Validation strategy

- Cross-check the plan against the current call graph and responsibilities in `load_sas.py` and `load_folder.py`.
- Prefer minimal-regression changes that preserve existing non-partitioned behavior.
- Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output.

## Documentation updates required

- Create `generic_loader/PARTITION_DESIGN.md` as the primary implementation-ready design artifact.
- Include exact sample YAML snippets for single-file and folder loaders.
- Document the dry-run cost change for partitioned loads and the `append` limitation that partitions are not auto-created.