foxtrot/PLAN.md
2026-04-20 09:56:00 -05:00

5.7 KiB

Plan: LIST Partition Support for generic_loader

Objective

Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in generic_loader/load_sas.py and generic_loader/load_folder.py, driven by YAML configuration and compatible with the current streaming COPY load path.

Current context

  • generic_loader/load_sas.py
    • LoaderConfig currently parses filename, schemaname, tablename, if_exists, include, and exclude.
    • Schema inference is based on read_sas_preview() plus infer_schema().
    • render_create_table() emits one non-partitioned CREATE TABLE statement.
    • create_table() handles if_exists=fail|replace|append for a single table.
    • copy_dataframes() streams rows into the target table with COPY ... FROM STDIN.
  • generic_loader/load_folder.py
    • FolderConfig carries folder defaults.
    • ClusterSpec carries resolved per-cluster load settings.
    • _ExplicitPattern stores optional per-cluster overrides for explicit matches.
    • discover_clusters() resolves cluster inheritance for if_exists, include, and exclude.
    • load_cluster() infers schema from the first file, creates one table, then streams all files into it.
  • Existing warnings are emitted to stderr as [warn] ...; the codebase does not currently use the logging module.
  • This task is design-only in architect mode; no production Python changes are being made here.

Assumptions and constraints

  • PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append.
  • Data continues to be copied to the parent table so PostgreSQL performs routing; copy_dataframes() behavior should remain unchanged.
  • Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset.
  • Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults.
  • if_exists=append must validate compatibility and skip partition creation.
  • Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing.

Files and systems likely affected

  • generic_loader/load_sas.py
  • generic_loader/load_folder.py
  • generic_loader/sample_config.yaml
  • generic_loader/sample_folder_config.yaml
  • generic_loader/PARTITION_DESIGN.md
  • Potentially module/CLI docstrings in generic_loader/load_sas.py and generic_loader/load_folder.py

Implementation approach

  1. Extend YAML and dataclass config surfaces with partition_by and max_partitions.
  2. Add partition-planning helpers that:
    • validate partition columns,
    • normalize partition values consistently with the existing COPY preparation rules,
    • discover cascading unique values across one file or an entire cluster,
    • count resulting child partition tables and emit a threshold warning.
  3. Extend DDL rendering so the parent table can be declared with PARTITION BY LIST (...) and child tables can be emitted recursively with CREATE TABLE ... PARTITION OF ... FOR VALUES IN (...).
  4. Extend table creation rules:
    • replace drops the parent with CASCADE when partitioning is enabled and recreates the full tree,
    • fail errors if the parent exists,
    • append validates schema plus partition-key compatibility and does not create partitions.
  5. Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass.
  6. Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution.

Risks and edge cases

  • High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning.
  • Empty strings in text columns currently become NULL on load because of COPY ... NULL ''; partition discovery must mirror that behavior or routing will be wrong.
  • Different raw values can collide after sanitization or truncation; deterministic disambiguation is required.
  • NULL partition values need explicit support in both DDL generation and child-table naming.
  • Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview.
  • Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately.

Acceptance criteria

  • The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and if_exists semantics.
  • Single-file and folder flows are both covered, including per-cluster inheritance/override behavior.
  • Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly.
  • The design explicitly identifies what remains unchanged, especially the COPY routing path.

Validation strategy

  • Cross-check the plan against the current call graph and responsibilities in load_sas.py and load_folder.py.
  • Prefer minimal-regression changes that preserve existing non-partitioned behavior.
  • Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output.

Documentation updates required

  • Create generic_loader/PARTITION_DESIGN.md as the primary implementation-ready design artifact.
  • Include exact sample YAML snippets for single-file and folder loaders.
  • Document the dry-run cost change for partitioned loads and the append limitation that partitions are not auto-created.