5.7 KiB
5.7 KiB
Plan: LIST Partition Support for generic_loader
Objective
Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in generic_loader/load_sas.py and generic_loader/load_folder.py, driven by YAML configuration and compatible with the current streaming COPY load path.
Current context
generic_loader/load_sas.pyLoaderConfigcurrently parsesfilename,schemaname,tablename,if_exists,include, andexclude.- Schema inference is based on
read_sas_preview()plusinfer_schema(). render_create_table()emits one non-partitionedCREATE TABLEstatement.create_table()handlesif_exists=fail|replace|appendfor a single table.copy_dataframes()streams rows into the target table withCOPY ... FROM STDIN.
generic_loader/load_folder.pyFolderConfigcarries folder defaults.ClusterSpeccarries resolved per-cluster load settings._ExplicitPatternstores optional per-cluster overrides for explicit matches.discover_clusters()resolves cluster inheritance forif_exists,include, andexclude.load_cluster()infers schema from the first file, creates one table, then streams all files into it.
- Existing warnings are emitted to stderr as
[warn] ...; the codebase does not currently use theloggingmodule. - This task is design-only in architect mode; no production Python changes are being made here.
Assumptions and constraints
- PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append.
- Data continues to be copied to the parent table so PostgreSQL performs routing;
copy_dataframes()behavior should remain unchanged. - Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset.
- Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults.
if_exists=appendmust validate compatibility and skip partition creation.- Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing.
Files and systems likely affected
generic_loader/load_sas.pygeneric_loader/load_folder.pygeneric_loader/sample_config.yamlgeneric_loader/sample_folder_config.yamlgeneric_loader/PARTITION_DESIGN.md- Potentially module/CLI docstrings in
generic_loader/load_sas.pyandgeneric_loader/load_folder.py
Implementation approach
- Extend YAML and dataclass config surfaces with
partition_byandmax_partitions. - Add partition-planning helpers that:
- validate partition columns,
- normalize partition values consistently with the existing
COPYpreparation rules, - discover cascading unique values across one file or an entire cluster,
- count resulting child partition tables and emit a threshold warning.
- Extend DDL rendering so the parent table can be declared with
PARTITION BY LIST (...)and child tables can be emitted recursively withCREATE TABLE ... PARTITION OF ... FOR VALUES IN (...). - Extend table creation rules:
replacedrops the parent withCASCADEwhen partitioning is enabled and recreates the full tree,failerrors if the parent exists,appendvalidates schema plus partition-key compatibility and does not create partitions.
- Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass.
- Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution.
Risks and edge cases
- High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning.
- Empty strings in text columns currently become
NULLon load because ofCOPY ... NULL ''; partition discovery must mirror that behavior or routing will be wrong. - Different raw values can collide after sanitization or truncation; deterministic disambiguation is required.
NULLpartition values need explicit support in both DDL generation and child-table naming.- Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview.
- Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately.
Acceptance criteria
- The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and
if_existssemantics. - Single-file and folder flows are both covered, including per-cluster inheritance/override behavior.
- Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly.
- The design explicitly identifies what remains unchanged, especially the
COPYrouting path.
Validation strategy
- Cross-check the plan against the current call graph and responsibilities in
load_sas.pyandload_folder.py. - Prefer minimal-regression changes that preserve existing non-partitioned behavior.
- Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output.
Documentation updates required
- Create
generic_loader/PARTITION_DESIGN.mdas the primary implementation-ready design artifact. - Include exact sample YAML snippets for single-file and folder loaders.
- Document the dry-run cost change for partitioned loads and the
appendlimitation that partitions are not auto-created.