85 lines
5.7 KiB
Markdown
85 lines
5.7 KiB
Markdown
# Plan: LIST Partition Support for generic_loader
|
|
|
|
## Objective
|
|
|
|
Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`, driven by YAML configuration and compatible with the current streaming `COPY` load path.
|
|
|
|
## Current context
|
|
|
|
- `generic_loader/load_sas.py`
|
|
- `LoaderConfig` currently parses `filename`, `schemaname`, `tablename`, `if_exists`, `include`, and `exclude`.
|
|
- Schema inference is based on `read_sas_preview()` plus `infer_schema()`.
|
|
- `render_create_table()` emits one non-partitioned `CREATE TABLE` statement.
|
|
- `create_table()` handles `if_exists=fail|replace|append` for a single table.
|
|
- `copy_dataframes()` streams rows into the target table with `COPY ... FROM STDIN`.
|
|
- `generic_loader/load_folder.py`
|
|
- `FolderConfig` carries folder defaults.
|
|
- `ClusterSpec` carries resolved per-cluster load settings.
|
|
- `_ExplicitPattern` stores optional per-cluster overrides for explicit matches.
|
|
- `discover_clusters()` resolves cluster inheritance for `if_exists`, `include`, and `exclude`.
|
|
- `load_cluster()` infers schema from the first file, creates one table, then streams all files into it.
|
|
- Existing warnings are emitted to stderr as `[warn] ...`; the codebase does not currently use the `logging` module.
|
|
- This task is design-only in architect mode; no production Python changes are being made here.
|
|
|
|
## Assumptions and constraints
|
|
|
|
- PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append.
|
|
- Data continues to be copied to the parent table so PostgreSQL performs routing; `copy_dataframes()` behavior should remain unchanged.
|
|
- Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset.
|
|
- Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults.
|
|
- `if_exists=append` must validate compatibility and skip partition creation.
|
|
- Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing.
|
|
|
|
## Files and systems likely affected
|
|
|
|
- `generic_loader/load_sas.py`
|
|
- `generic_loader/load_folder.py`
|
|
- `generic_loader/sample_config.yaml`
|
|
- `generic_loader/sample_folder_config.yaml`
|
|
- `generic_loader/PARTITION_DESIGN.md`
|
|
- Potentially module/CLI docstrings in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`
|
|
|
|
## Implementation approach
|
|
|
|
1. Extend YAML and dataclass config surfaces with `partition_by` and `max_partitions`.
|
|
2. Add partition-planning helpers that:
|
|
- validate partition columns,
|
|
- normalize partition values consistently with the existing `COPY` preparation rules,
|
|
- discover cascading unique values across one file or an entire cluster,
|
|
- count resulting child partition tables and emit a threshold warning.
|
|
3. Extend DDL rendering so the parent table can be declared with `PARTITION BY LIST (...)` and child tables can be emitted recursively with `CREATE TABLE ... PARTITION OF ... FOR VALUES IN (...)`.
|
|
4. Extend table creation rules:
|
|
- `replace` drops the parent with `CASCADE` when partitioning is enabled and recreates the full tree,
|
|
- `fail` errors if the parent exists,
|
|
- `append` validates schema plus partition-key compatibility and does not create partitions.
|
|
5. Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass.
|
|
6. Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution.
|
|
|
|
## Risks and edge cases
|
|
|
|
- High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning.
|
|
- Empty strings in text columns currently become `NULL` on load because of `COPY ... NULL ''`; partition discovery must mirror that behavior or routing will be wrong.
|
|
- Different raw values can collide after sanitization or truncation; deterministic disambiguation is required.
|
|
- `NULL` partition values need explicit support in both DDL generation and child-table naming.
|
|
- Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview.
|
|
- Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately.
|
|
|
|
## Acceptance criteria
|
|
|
|
- The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and `if_exists` semantics.
|
|
- Single-file and folder flows are both covered, including per-cluster inheritance/override behavior.
|
|
- Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly.
|
|
- The design explicitly identifies what remains unchanged, especially the `COPY` routing path.
|
|
|
|
## Validation strategy
|
|
|
|
- Cross-check the plan against the current call graph and responsibilities in `load_sas.py` and `load_folder.py`.
|
|
- Prefer minimal-regression changes that preserve existing non-partitioned behavior.
|
|
- Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output.
|
|
|
|
## Documentation updates required
|
|
|
|
- Create `generic_loader/PARTITION_DESIGN.md` as the primary implementation-ready design artifact.
|
|
- Include exact sample YAML snippets for single-file and folder loaders.
|
|
- Document the dry-run cost change for partitioned loads and the `append` limitation that partitions are not auto-created.
|