michael-corey 0d955eeab1 adding partition flag

2026-04-20 09:56:00 -05:00

35 KiB

Raw Blame History

Partition Feature Design for generic_loader

1. Objective

Add PostgreSQL LIST partitioning support to load_sas.py and load_folder.py without changing the existing streaming COPY data path in copy_dataframes(). The feature must be YAML-driven, must support cascading partition levels, and must keep non-partitioned behavior unchanged.

2. Current baseline

Single-file loader

The single-file path is centered in generic_loader/load_sas.py:

LoaderConfig stores file path, target schema/table, if_exists, and column filters.
load_config() parses YAML.
read_sas_preview() reads a bounded preview for schema inference.
infer_schema() infers Postgres column types.
render_create_table() renders one non-partitioned CREATE TABLE statement.
create_table() executes table creation or append/replace checks.
copy_dataframes() streams chunks into the target table via COPY ... FROM STDIN.

Folder loader

The folder path is centered in generic_loader/load_folder.py:

ClusterSpec stores resolved per-cluster load settings.
_ExplicitPattern stores raw per-cluster YAML overrides.
FolderConfig stores folder defaults.
load_folder_config() parses folder YAML.
discover_clusters() resolves inheritance and groups files.
load_cluster() creates a table from the first file and streams every file in the cluster into it.

Important current behaviors to preserve

copy_dataframes() copies into exactly one qualified table name and should remain unchanged.
create_table() owns if_exists semantics and should remain the single gate for fail/replace/append behavior.
Warnings are currently emitted to stderr as [warn] ..., for example in _assert_schema_compatible(), and the feature should follow that pattern instead of introducing a repository-wide logging refactor.

3. Scope and non-goals

In scope

Optional YAML partition_by support.
Configurable max_partitions threshold with default 10000.
Single-level and multi-level cascading LIST partitions.
Partition value discovery from the incoming dataset at runtime.
Recursive DDL generation for parent and child partitions.
Folder-level defaults plus per-cluster overrides.
Dry-run output for the full DDL tree.

Explicitly out of scope for this implementation

RANGE or HASH partitioning.
Expression-based partition keys.
Changing row-routing behavior in copy_dataframes().
Automatically creating missing partitions in append mode.
Reworking manifest validation in validate_against_manifest().

4. YAML schema changes

4.1 Single-file config

Update the sample shape documented by generic_loader/sample_config.yaml to include partition_by and max_partitions.

Proposed exact example

filename: samples/sample_kitchensink.xpt
schemaname: public
tablename: kitchensink

# Optional. If set, only these columns are loaded. Mutually exclusive with exclude.
# include:
#   - ID
#   - INTCOL
#   - DATECOL

# Optional. Columns to drop.
# exclude:
#   - ALLNULL

# Optional. Create cascading LIST partitions in this order.
# Omit or set [] for no partitioning.
partition_by:
  - state
  - zip

# Optional. Warn if the load would create more than this many partition tables.
# The load continues. Default: 10000.
max_partitions: 10000

# What to do if the target table already exists: fail | replace | append
# Defaults to fail.
if_exists: append

Parsing and validation rules

partition_by is optional.
Omitted, null, or [] means "not partitioned".
When present and non-empty, it must be a YAML sequence of non-empty strings.
Order matters. ['state', 'zip'] means state is level 1 and zip is level 2.
Duplicate names are invalid.
If include is present, every partition_by column must be included.
If exclude is present, no partition_by column may be excluded.
max_partitions is optional and defaults to 10000.
max_partitions must be an integer greater than 0.

4.2 Folder config

Update the sample shape documented by generic_loader/sample_folder_config.yaml to include folder defaults and per-cluster overrides.

Proposed exact example

folder: samples/folder_test
schemaname: public

# Applied when creating the first file of each cluster.
# One of: fail | replace | append. Default: fail.
if_exists: replace

# When true (default), any file not matched by an explicit pattern below is
# auto-grouped with its peers.
auto_detect: true

# Optional folder-level column filter.
# include:
#   - ID
#   - INTCOL
# exclude:
#   - ALLNULL

# Optional folder default for LIST partitioning.
partition_by:
  - state
  - zip

# Optional folder default threshold. Default: 10000.
max_partitions: 10000

clusters:
  - pattern: '^group_a\d+\.xpt$'
    tablename: group_a
    # Inherits folder-level partition_by and max_partitions.

  - pattern: '^group_b\d+\.xpt$'
    tablename: group_b
    partition_by:
      - state
    max_partitions: 2000

  - pattern: '^standalone\.xpt$'
    tablename: standalone
    partition_by: []   # Explicit opt-out of the folder default.

Folder override rules

Folder-level partition_by and max_partitions behave as defaults.
In an explicit cluster entry:
- if partition_by is omitted, inherit the folder-level value;
- if partition_by is a non-empty list, replace the folder-level value;
- if partition_by: [], explicitly disable partitioning for that cluster.
Cluster-level max_partitions overrides the folder-level threshold when present.
The resolved per-cluster rules should follow the same pattern already used by discover_clusters() for if_exists, include, and exclude.

5. Dataclass changes

5.1 Existing public config dataclasses

`LoaderConfig`

Add:

partition_by: Optional[List[str]] = None
max_partitions: int = 10000

`ClusterSpec`

Add resolved fields:

partition_by: Optional[List[str]]
max_partitions: int

`_ExplicitPattern`

Add raw optional override fields:

partition_by: Optional[List[str]] = None
max_partitions: Optional[int] = None

Notes:

Preserve partition_by=[] when it appears in the YAML so discover_clusters() can distinguish explicit disable from inheritance.
max_partitions remains None when omitted so folder inheritance can resolve it later.

`FolderConfig`

Add:

partition_by: Optional[List[str]] = None
max_partitions: int = 10000

5.2 Recommended new internal helper dataclasses

These are not required to be public, but they make the implementation substantially safer and clearer.

Recommended `PartitionNode`

Suggested fields:

field_name: str
value: Any
The normalized value Postgres will see during COPY; use None for SQL NULL.
table_name: str
children: List[PartitionNode] = field(default_factory=list)

Recommended `PartitionPlan`

Suggested fields:

fields: List[str]
roots: List[PartitionNode]
total_partition_tables: int

The implementation can use nested dicts instead, but an explicit plan object reduces naming, recursion, and dry-run bugs.

6. New functions needed

The exact names may vary, but the design should introduce helpers with the responsibilities below.

6.1 Config parsing helpers

Recommended `_parse_partition_by()`

Purpose:

Parse partition_by from YAML.
Enforce list-of-strings validation.
Normalize omitted/empty top-level values to None.
Preserve cluster-level empty list [] long enough for override resolution.

Recommended `_parse_max_partitions()`

Purpose:

Parse and validate max_partitions.
Enforce positive integer semantics.

6.2 Partition validation helpers

Recommended `_validate_partition_columns()`

Purpose:

Ensure every requested partition column exists after apply_column_filter().
Fail early if a partition column was removed by include or exclude.
Produce context-rich errors that name the config, file, or cluster.

Recommended `_assert_partition_compatible()`

Purpose:

In append mode, verify that the existing parent table is LIST-partitioned on the same ordered keys.
Reuse SchemaCompatibilityError for incompatibility.

Expected catalog check:

Query pg_partitioned_table for partstrat.
Query pg_attribute using partattrs order to get the parent key columns.
Require partstrat = 'l'.
Require the ordered key list to exactly equal the resolved partition_by list.

6.3 Partition discovery helpers

Recommended `discover_partition_values()`

Purpose:

Scan an iterable of filtered DataFrames.
Normalize the partition columns the same way _prepare_for_copy() will normalize them for COPY.
Build a cascading partition tree scoped by parent value.
Count the child partition tables that will be created.

Suggested input shape:

dfs: Iterable[pd.DataFrame]
columns: Dict[str, ColumnSpec]
partition_by: List[str]
root_table_name: str

Suggested output shape:

PartitionPlan

Recommended `_warn_if_partition_count_exceeds()`

Purpose:

Emit [warn] ... to stderr if plan.total_partition_tables > max_partitions.
Never abort the load.

6.4 Naming and literal helpers

Recommended `_sanitize_partition_token()`

Purpose:

Convert a normalized partition value into a safe, deterministic table-name suffix.

Recommended `_build_partition_table_name()`

Purpose:

Combine parent name and sanitized token.
Enforce Postgres identifier-length limits.
Resolve collisions deterministically.

Recommended `_render_partition_literal()`

Purpose:

Render one value for FOR VALUES IN (...).
Preserve the exact routed value Postgres will see during COPY.

6.5 DDL rendering helpers

Recommended `render_partition_ddl()`

Purpose:

Render child CREATE TABLE ... PARTITION OF ... statements recursively.

Recommended `render_create_table_statements()`

Purpose:

Return the full ordered statement list for dry-run and actual execution.
Keep the root statement first.
Append recursive child statements afterward.

6.6 Optional shared warning helper

Recommended `_warn()`

Purpose:

Centralize the existing [warn] ... stderr behavior.
Let both existing schema warnings and new partition warnings share one implementation.

7. Modified functions

7.1 `load_config()`

Modify to:

Parse partition_by.
Parse max_partitions.
Validate include/exclude conflicts with partition_by.
Return the new fields in LoaderConfig.

7.2 `render_create_table()`

Modify signature to accept optional partition metadata:

partition_by: Optional[List[str]] = None

Behavior:

If partition_by is falsy, keep current output unchanged.
If partition_by is present, append PARTITION BY LIST (<first field>) to the parent statement.
This function should still render only the parent statement; child statements belong in render_partition_ddl().

Example parent output:

CREATE TABLE "public"."customers" (
    "state" TEXT,
    "zip" TEXT,
    "name" TEXT
) PARTITION BY LIST ("state");

7.3 `_drop_table()`

Add an optional cascade: bool = False parameter.

Behavior:

Non-partitioned replace keeps current plain DROP TABLE behavior.
Partitioned replace uses DROP TABLE <qualified> CASCADE so the parent drop removes all partitions.

7.4 `create_table()`

Extend signature to accept:

partition_by: Optional[List[str]] = None
partition_plan: Optional[PartitionPlan] = None

Behavior:

Preserve current if_exists validation.
For non-partitioned loads, preserve current behavior.
For partitioned loads:
- fail: if the parent table exists, raise TableExistsError.
- replace: if the parent exists, drop it with CASCADE, then recreate the full tree.
- append: run _assert_schema_compatible() plus _assert_partition_compatible(), then return without creating any partitions.
When creation is needed, execute the full statement list returned by render_create_table_statements().
Reject partition_by without a computed partition_plan when creation or dry-run rendering needs it.

7.5 `_prepare_for_copy()`

Recommended refactor:

Extract or share the per-column normalization logic so partition discovery can use the same conversion rules.
Do not change external behavior of the returned DataFrame.

Reason:

The partition discovery pass must reason about the same values Postgres will actually receive.
The most important special case is text columns, where empty strings currently become SQL NULL because copy_dataframes() uses NULL ''.

7.6 `main()`

Modify the single-file flow as follows:

Load config.
Read preview and infer schema exactly as today.
Validate that partition columns exist after filtering.
If partition_by is set and the operation needs creation or dry-run rendering, run a full discovery pass over the file to build a PartitionPlan.
In dry-run mode, print the full DDL statement list rather than only the parent statement.
In live mode, pass partition_by and partition_plan into create_table().
Keep copy_dataframes() unchanged so data is copied to the parent table and Postgres routes rows automatically.

7.7 `load_folder_config()`

Modify to:

Parse folder-level partition_by and max_partitions.
Parse per-cluster partition_by and max_partitions.
Validate include/exclude conflicts against the applicable partition list where possible.
Preserve explicit partition_by: [] so cluster discovery can treat it as "disable inheritance".

7.8 `discover_clusters()`

Modify to resolve per-cluster partition settings.

For each resolved ClusterSpec:

partition_by = patt.partition_by if patt.partition_by is not None else cfg.partition_by
max_partitions = patt.max_partitions if patt.max_partitions is not None else cfg.max_partitions
normalize resolved empty list to None before storing on the final ClusterSpec

7.9 `load_cluster()`

Modify the cluster load order to:

Infer schema from the first file exactly as today.
Validate partition columns against that schema.
If the cluster is partitioned and the operation is not append-only verification, scan all files in the cluster to build one shared PartitionPlan.
Call create_table() with resolved partition_by and partition_plan.
Stream all files into the parent table exactly as today.

7.10 `main()`

Modify dry-run behavior:

keep cluster discovery output;
for each loadable cluster, print full DDL, not only one CREATE TABLE statement;
when a cluster is partitioned, perform partition discovery across every file in that cluster, not only the first file.

Also update the --dry-run help text because the current wording says the schema is inferred from only the first file of the cluster.

8. Partition value discovery algorithm

8.1 High-level rules

Discovery operates on filtered data, meaning after the same column filter logic used by apply_column_filter().
Discovery must use the same semantic values that Postgres will see during COPY, not raw pandas object identity.
The scan should be streaming and chunk-based to avoid materializing the full file or cluster in memory.
The resulting tree must scope each level under its parent so deeper values are not treated as globally unique.

8.2 Normalization rules for partition keys

Partition discovery should normalize each partition column using the same type-aware logic already embodied in _prepare_for_copy(), with the following behavior:

Integer-like columns (INTEGER, BIGINT, SMALLINT): coerce object values through numeric conversion, treat blank strings and NaN as NULL.
Floating/numeric columns (DOUBLE PRECISION, REAL, NUMERIC): numeric conversion, NaN becomes NULL.
Date columns: normalize to datetime.date or NULL.
Timestamp columns: normalize to datetime.datetime or NULL.
Time columns: normalize through the existing time conversion path or NULL.
Text-like columns: None, pandas nulls, and '' all become semantic NULL, because copy_dataframes() sends empty strings with NULL ''.
Boolean columns: normalize to True, False, or NULL.

This means partition discovery deduplicates on the routed value, not the raw source representation. For example, '00123' and 123 in an integer partition column should produce one partition value 123, not two separate partitions.

8.3 Discovery pseudocode

def discover_partition_values(dfs, columns, partition_by, root_table_name):
    validate_partition_columns(columns, partition_by)

    root = PartitionPlan(fields=partition_by, roots=[], total_partition_tables=0)
    root_index = {}  # normalized value -> PartitionNode for depth 0

    for df in dfs:
        if df.empty:
            continue

        part_df = df[partition_by].copy()
        part_df = normalize_partition_frame(part_df, columns)
        unique_paths = part_df.drop_duplicates()

        for path in unique_paths.itertuples(index=False, name=None):
            parent_table = root_table_name
            parent_children = root.roots
            parent_index = root_index

            for depth, value in enumerate(path):
                field_name = partition_by[depth]
                if value not in parent_index:
                    child_table = build_partition_table_name(parent_table, value)
                    node = PartitionNode(
                        field_name=field_name,
                        value=value,
                        table_name=child_table,
                    )
                    parent_index[value] = node
                    parent_children.append(node)
                    root.total_partition_tables += 1

                node = parent_index[value]
                parent_table = node.table_name
                parent_children = node.children
                parent_index = getattr(node, "_index", {})

    sort_every_node_deterministically(root)
    return root

8.4 Efficient implementation notes

The scan should retain only the partition columns for the current chunk after filtering.
The in-memory structure should grow only with the number of unique partition nodes, not the number of rows.
Reading partition values from the preview frame is only valid when that frame is known to contain the entire dataset. In the current CLI flow, the preview is normally not exhaustive, so partitioned loads should perform a full chunked scan.
A future optimization may add optional reader-level column pruning to iter_sas_chunks() and read_sas_preview(), but that is not required for correctness.

9. DDL generation algorithm

9.1 Root table

If partition_by is set, the parent statement produced by render_create_table() must end with:

PARTITION BY LIST ("<first partition field>")

The parent still contains the full column list.

9.2 Child tables

For each discovered node:

if it is not the last partition level, create a child partition that is itself subpartitioned by the next field;
if it is the last partition level, create a leaf partition with no further PARTITION BY clause.

Examples for partition_by: [state, zip]:

CREATE TABLE "public"."customers_ca"
PARTITION OF "public"."customers"
FOR VALUES IN ('CA')
PARTITION BY LIST ("zip");

CREATE TABLE "public"."customers_ca_60601"
PARTITION OF "public"."customers_ca"
FOR VALUES IN ('60601');

9.3 DDL rendering pseudocode

def render_create_table_statements(schema, table, columns, partition_by, plan):
    statements = [render_create_table(schema, table, columns, partition_by=partition_by)]
    if partition_by:
        statements.extend(render_partition_ddl(schema, table, columns, partition_by, plan.roots, depth=0))
    return statements


def render_partition_ddl(schema, parent_table, columns, partition_by, nodes, depth):
    field_name = partition_by[depth]
    next_field = partition_by[depth + 1] if depth + 1 < len(partition_by) else None
    field_spec = columns[field_name]
    statements = []

    for node in nodes:
        literal = render_partition_literal(node.value, field_spec)
        if next_field is None:
            statements.append(
                f'CREATE TABLE {qualified(schema, node.table_name)} '
                f'PARTITION OF {qualified(schema, parent_table)} '
                f'FOR VALUES IN ({literal});'
            )
        else:
            statements.append(
                f'CREATE TABLE {qualified(schema, node.table_name)} '
                f'PARTITION OF {qualified(schema, parent_table)} '
                f'FOR VALUES IN ({literal}) '
                f'PARTITION BY LIST ({quote_ident(next_field)});'
            )
            statements.extend(
                render_partition_ddl(
                    schema,
                    node.table_name,
                    columns,
                    partition_by,
                    node.children,
                    depth + 1,
                )
            )

    return statements

9.4 Statement order

Emit statements in this order:

parent table;
each level-1 child;
that child’s descendants before moving to the next sibling.

This depth-first order guarantees that every parent exists before its children are created.

10. Table-name sanitization rules

The child-table name rule must be deterministic and explicit.

10.1 Base token generation

For each normalized partition value:

Convert to a display token:
- None -> null
- datetime.date, datetime.time, datetime.datetime -> isoformat() string
- everything else -> str(value)
Lowercase the token.
Replace every run of one or more non-alphanumeric characters with _.
Trim leading and trailing _.
If the result is empty, use value.

Examples:

CA -> ca
New York -> new_york
60601-1234 -> 60601_1234
NULL -> null
*** -> value

10.2 Final child name

Child names are:

{parent_table}_{sanitized_token}

Examples:

customers + CA -> customers_ca
customers_ca + 60601 -> customers_ca_60601

10.3 Length limit

Postgres identifiers are limited to 63 bytes. The implementation should treat 63 characters as the working limit because the loader currently emits ASCII-only sanitized suffixes.

Rules:

If len(parent_table) >= 62, fail fast with a clear error because there is no room for _x.
Otherwise, reserve len(parent_table) + 1 characters for the prefix and underscore.
Truncate only the sanitized token, not the parent prefix.
If truncation makes two child names collide, append a deterministic short hash.

10.4 Collision handling

Different raw values can sanitize to the same token, for example:

A-B -> a_b
A B -> a_b

Recommended collision rule:

First candidate: parent_a_b
On collision, append _<hash8> derived from the exact normalized value for that node.
Re-truncate the base token as needed so the final name still fits the 63-character limit.

Example:

parent_a_b
parent_a_b_f15c2d19

This keeps names deterministic across runs and avoids dependence on discovery order.

11. Partition literal rendering rules

The FOR VALUES IN (...) clause must use the exact routed value Postgres will receive after loader normalization.

Recommended rendering rules:

NULL -> NULL
text -> single-quoted with internal quotes escaped
integers / numerics -> unquoted numeric literal
boolean -> TRUE or FALSE
date -> DATE 'YYYY-MM-DD'
timestamp -> TIMESTAMP 'YYYY-MM-DD HH:MM:SS'
time -> TIME 'HH:MM:SS'

Important special case:

text '' must not render as ''; it must render as NULL because copy_dataframes() uses NULL ''.

12. `if_exists` interaction

12.1 `fail`

If the parent table exists, behavior is unchanged: raise TableExistsError.
No partition compatibility inspection is needed because the operation stops immediately.

12.2 `replace`

If the parent table exists and the config is partitioned, execute DROP TABLE <parent> CASCADE.
Recreate the parent plus every partition statement in one transaction.
If any statement fails, let the outer transaction rollback preserve atomicity.

12.3 `append`

Required behavior:

Run _assert_schema_compatible() on the parent table exactly as today.
If partition_by is configured, also verify that the parent is LIST-partitioned on the same ordered keys.
Do not create any partitions.
Copy rows to the parent table and let Postgres route them.

Accepted limitation for v1:

If the existing partition tree does not contain a leaf partition for some incoming value, Postgres will fail during COPY with a native partition-routing error.
This design does not require preflight catalog validation of every leaf partition because that adds significant scope and catalog-parsing complexity.

13. Dry-run behavior

13.1 Single-file loader

Current dry-run behavior in main() prints only one statement from render_create_table(). For partitioned configs it should change to:

infer schema from the preview as today;
run full partition discovery over the file;
warn on stderr if total_partition_tables > max_partitions;
print the full ordered DDL statement list to stdout;
open no database connection.

Output format recommendation:

print statements separated by one blank line for readability;
do not print extra prose on stdout, so the output remains easy to paste into SQL tooling.

13.2 Folder loader

Current dry-run behavior in main() prints one CREATE TABLE per cluster based on the first file only. For partitioned clusters it should change to:

keep printing the discovered cluster summary;
for each loadable cluster, print a header such as --- DDL for cluster 'group_a' ---;
infer schema from the first file as today;
if the cluster is partitioned, scan all files in that cluster to build one shared PartitionPlan;
print the full ordered DDL statement list.

Important documentation note:

Partitioned dry-runs are now full-data scans over the partition columns and can take much longer than non-partitioned dry-runs.

14. Error handling

The implementation should handle failures at the earliest safe point with clear messages.

14.1 Config-time errors

Raise ValueError from load_config() or load_folder_config() for:

partition_by not being a list
empty or non-string items inside partition_by
duplicate partition column names
max_partitions <= 0
include omitting a partition column
exclude removing a partition column
cluster config specifying an invalid override shape

14.2 Runtime validation errors before DDL

Raise ValueError with file/cluster context for:

partition column not present after filtering
partition column absent from the inferred schema
parent table name too long to create child suffixes safely
a partition value that cannot be normalized or rendered into SQL

14.3 Append-time compatibility errors

Raise SchemaCompatibilityError for:

parent column mismatch detected by _assert_schema_compatible()
existing parent not being partitioned when partition_by is configured
existing parent using a partition strategy other than LIST
existing parent using a different ordered key list

14.4 Warning-only conditions

Emit [warn] ... to stderr, but continue, for:

total_partition_tables > max_partitions
existing warnings already emitted by _assert_schema_compatible()

Recommended warning message:

[warn] partition plan for public.customers will create 12,431 partition tables, exceeding max_partitions=10,000

14.5 Postgres runtime errors left to bubble

Do not swallow driver/database exceptions for:

DDL execution failures
COPY failures caused by missing append-mode partitions
any transaction failure during live loading

The outer transaction handling in main() and main() should remain responsible for rollback.

15. Detailed single-file flow after the change

load_config
-> read_sas_preview
-> apply_column_filter
-> infer_schema
-> validate partition columns
-> if validate flag: run manifest validation
-> if partitioned and (dry-run or create needed): discover partition values from full file
-> if dry-run: print full DDL and exit
-> connect
-> create_table (with partition metadata)
-> copy_dataframes to parent table
-> commit / rollback exactly as today

Notes:

A partitioned live load usually requires one preview read, one full discovery pass, and one full load pass.
This is a deliberate tradeoff to ensure the full partition tree exists before any row is copied.

16. Detailed folder flow after the change

For each cluster in load_cluster():

infer schema from first file preview
-> validate partition columns
-> if partitioned and creation is needed: discover partition values across all files in the cluster
-> create_table (with partition metadata)
-> stream every file to the parent table
-> for later files, keep the existing append-mode schema compatibility check

Notes:

The partition plan is cluster-wide, not file-by-file.
All files in the cluster must route into one shared partition tree under the same parent table.

17. What remains unchanged

infer_schema() keeps its current type-inference behavior.
copy_dataframes() remains unchanged and still copies to the parent table.
assert_schema_compatible() remains the public wrapper for append compatibility.
Non-partitioned configs should continue to produce exactly one CREATE TABLE statement and the same load behavior as today.

18. Implementation sequencing

Recommended implementation order:

Extend config dataclasses and parsers.
Add partition parsing/validation helpers.
Add internal partition plan data structure.
Add partition discovery and literal/name helpers.
Extend DDL rendering.
Extend create_table() and _drop_table().
Wire the single-file flow.
Wire the folder flow and inheritance rules.
Update dry-run/help text and sample YAML files.

19. QA and validation matrix

The implementation should be validated against at least these scenarios:

Non-partitioned single-file load still behaves exactly as before.
Single-level text partitioning creates one child per unique value.
Multi-level cascading partitioning scopes child values to their parent.
NULL partition values create FOR VALUES IN (NULL) partitions.
Text empty strings route to the NULL partition, not ''.
Sanitization collision (A-B vs A B) resolves deterministically.
Very long child names truncate correctly and still remain unique.
max_partitions warning appears but the load continues.
replace drops the parent with CASCADE and recreates the full tree.
append rejects a parent with the wrong partition strategy or key order.
Folder-level partition_by is inherited by auto-detected clusters.
Explicit cluster partition_by overrides folder defaults.
Explicit cluster partition_by: [] disables a folder default.
Dry-run prints the full DDL tree and opens no connection.
Partitioned folder dry-run scans all files in the cluster, not just the first one.

20. Documentation updates required

In addition to implementing the code, update:

generic_loader/sample_config.yaml with partition_by and max_partitions comments and examples.
generic_loader/sample_folder_config.yaml with folder defaults, cluster overrides, and explicit opt-out examples.
The module-level usage text in load_sas.py so dry-run docs mention full DDL for partitioned tables.
The module-level usage text in load_folder.py so dry-run docs mention cluster-wide partition discovery.

21. Final design summary

The safest low-regression approach is:

keep the current schema inference path unchanged;
add a separate full-data partition discovery pass for partitioned loads;
render one parent CREATE TABLE plus recursive PARTITION OF child statements;
create or replace the full tree before copying any data;
leave copy_dataframes() unchanged so PostgreSQL handles routing;
keep append mode strict about parent compatibility and intentionally do not auto-create missing partitions.

That approach satisfies the feature requirements while containing code churn to config parsing, DDL rendering, runtime planning, and folder integration.

35 KiB Raw Blame History Unescape Escape

Partition Feature Design for generic_loader

1. Objective

2. Current baseline

Single-file loader

Folder loader

Important current behaviors to preserve

3. Scope and non-goals

In scope

Explicitly out of scope for this implementation

4. YAML schema changes

4.1 Single-file config

Proposed exact example

Parsing and validation rules

4.2 Folder config

Proposed exact example

Folder override rules

5. Dataclass changes

5.1 Existing public config dataclasses

LoaderConfig

ClusterSpec

_ExplicitPattern

FolderConfig

5.2 Recommended new internal helper dataclasses

Recommended PartitionNode

Recommended PartitionPlan

6. New functions needed

6.1 Config parsing helpers

Recommended _parse_partition_by()

Recommended _parse_max_partitions()

6.2 Partition validation helpers

Recommended _validate_partition_columns()

Recommended _assert_partition_compatible()

6.3 Partition discovery helpers

Recommended discover_partition_values()

Recommended _warn_if_partition_count_exceeds()

6.4 Naming and literal helpers

Recommended _sanitize_partition_token()

Recommended _build_partition_table_name()

Recommended _render_partition_literal()

6.5 DDL rendering helpers

Recommended render_partition_ddl()

Recommended render_create_table_statements()

6.6 Optional shared warning helper

Recommended _warn()

7. Modified functions

7.1 load_config()

7.2 render_create_table()

7.3 _drop_table()

7.4 create_table()

7.5 _prepare_for_copy()

7.6 main()

7.7 load_folder_config()

7.8 discover_clusters()

7.9 load_cluster()

7.10 main()

8. Partition value discovery algorithm

8.1 High-level rules

8.2 Normalization rules for partition keys

8.3 Discovery pseudocode

8.4 Efficient implementation notes

9. DDL generation algorithm

9.1 Root table

9.2 Child tables

9.3 DDL rendering pseudocode

9.4 Statement order

10. Table-name sanitization rules

10.1 Base token generation

10.2 Final child name

10.3 Length limit

10.4 Collision handling

11. Partition literal rendering rules

12. if_exists interaction

12.1 fail

12.2 replace

12.3 append

13. Dry-run behavior

13.1 Single-file loader

13.2 Folder loader

14. Error handling

35 KiB

Raw Blame History

`LoaderConfig`

`ClusterSpec`

`_ExplicitPattern`

`FolderConfig`

Recommended `PartitionNode`

Recommended `PartitionPlan`

Recommended `_parse_partition_by()`

Recommended `_parse_max_partitions()`

Recommended `_validate_partition_columns()`

Recommended `_assert_partition_compatible()`

Recommended `discover_partition_values()`

Recommended `_warn_if_partition_count_exceeds()`

Recommended `_sanitize_partition_token()`

Recommended `_build_partition_table_name()`

Recommended `_render_partition_literal()`

Recommended `render_partition_ddl()`

Recommended `render_create_table_statements()`

Recommended `_warn()`

7.1 `load_config()`

7.2 `render_create_table()`

7.3 `_drop_table()`

7.4 `create_table()`

7.5 `_prepare_for_copy()`

7.6 `main()`

7.7 `load_folder_config()`

7.8 `discover_clusters()`

7.9 `load_cluster()`

7.10 `main()`

12. `if_exists` interaction

12.1 `fail`

12.2 `replace`

12.3 `append`