adding partition flag

This commit is contained in:
michael-corey 2026-04-20 09:56:00 -05:00
parent e39eb47a90
commit 0d955eeab1
6 changed files with 2033 additions and 26 deletions

84
PLAN.md Normal file
View File

@ -0,0 +1,84 @@
# Plan: LIST Partition Support for generic_loader
## Objective
Produce an implementation-ready design for adding PostgreSQL LIST partitioning to the generic loader flows in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`, driven by YAML configuration and compatible with the current streaming `COPY` load path.
## Current context
- `generic_loader/load_sas.py`
- `LoaderConfig` currently parses `filename`, `schemaname`, `tablename`, `if_exists`, `include`, and `exclude`.
- Schema inference is based on `read_sas_preview()` plus `infer_schema()`.
- `render_create_table()` emits one non-partitioned `CREATE TABLE` statement.
- `create_table()` handles `if_exists=fail|replace|append` for a single table.
- `copy_dataframes()` streams rows into the target table with `COPY ... FROM STDIN`.
- `generic_loader/load_folder.py`
- `FolderConfig` carries folder defaults.
- `ClusterSpec` carries resolved per-cluster load settings.
- `_ExplicitPattern` stores optional per-cluster overrides for explicit matches.
- `discover_clusters()` resolves cluster inheritance for `if_exists`, `include`, and `exclude`.
- `load_cluster()` infers schema from the first file, creates one table, then streams all files into it.
- Existing warnings are emitted to stderr as `[warn] ...`; the codebase does not currently use the `logging` module.
- This task is design-only in architect mode; no production Python changes are being made here.
## Assumptions and constraints
- PostgreSQL native LIST partitioning only; no range/hash partitioning and no automatic creation of missing partitions during append.
- Data continues to be copied to the parent table so PostgreSQL performs routing; `copy_dataframes()` behavior should remain unchanged.
- Accurate partition creation requires a complete discovery pass across all incoming rows unless the preview is known to already contain the full dataset.
- Folder-level partition settings should resolve into concrete per-cluster settings using the same inheritance style as current folder defaults.
- `if_exists=append` must validate compatibility and skip partition creation.
- Documentation must be detailed enough for an implementer to modify code, a QA lead to derive test scenarios, and a docs lead to update user-facing instructions without guessing.
## Files and systems likely affected
- `generic_loader/load_sas.py`
- `generic_loader/load_folder.py`
- `generic_loader/sample_config.yaml`
- `generic_loader/sample_folder_config.yaml`
- `generic_loader/PARTITION_DESIGN.md`
- Potentially module/CLI docstrings in `generic_loader/load_sas.py` and `generic_loader/load_folder.py`
## Implementation approach
1. Extend YAML and dataclass config surfaces with `partition_by` and `max_partitions`.
2. Add partition-planning helpers that:
- validate partition columns,
- normalize partition values consistently with the existing `COPY` preparation rules,
- discover cascading unique values across one file or an entire cluster,
- count resulting child partition tables and emit a threshold warning.
3. Extend DDL rendering so the parent table can be declared with `PARTITION BY LIST (...)` and child tables can be emitted recursively with `CREATE TABLE ... PARTITION OF ... FOR VALUES IN (...)`.
4. Extend table creation rules:
- `replace` drops the parent with `CASCADE` when partitioning is enabled and recreates the full tree,
- `fail` errors if the parent exists,
- `append` validates schema plus partition-key compatibility and does not create partitions.
5. Extend dry-run output so partitioned loads print the full ordered DDL set and perform the required partition discovery pass.
6. Extend folder orchestration so per-cluster partition settings inherit or override folder defaults in the same style as current config resolution.
## Risks and edge cases
- High-cardinality partition columns can generate very large partition trees, long DDL output, and slow Postgres planning.
- Empty strings in text columns currently become `NULL` on load because of `COPY ... NULL ''`; partition discovery must mirror that behavior or routing will be wrong.
- Different raw values can collide after sanitization or truncation; deterministic disambiguation is required.
- `NULL` partition values need explicit support in both DDL generation and child-table naming.
- Partitioned dry-runs become more expensive because they require scanning full source data rather than using only the schema preview.
- Multi-file clusters can still fail later on schema differences outside partition columns unless compatibility checks are broadened deliberately.
## Acceptance criteria
- The design document specifies exact YAML changes, dataclass changes, new helper functions, modified functions, algorithms, error handling, dry-run behavior, and `if_exists` semantics.
- Single-file and folder flows are both covered, including per-cluster inheritance/override behavior.
- Child-table naming, literal rendering, warning semantics, and append-mode validation are precise enough to implement directly.
- The design explicitly identifies what remains unchanged, especially the `COPY` routing path.
## Validation strategy
- Cross-check the plan against the current call graph and responsibilities in `load_sas.py` and `load_folder.py`.
- Prefer minimal-regression changes that preserve existing non-partitioned behavior.
- Include pseudocode and concrete examples for recursive partition DDL generation, cascading value discovery, null handling, and dry-run output.
## Documentation updates required
- Create `generic_loader/PARTITION_DESIGN.md` as the primary implementation-ready design artifact.
- Include exact sample YAML snippets for single-file and folder loaders.
- Document the dry-run cost change for partitioned loads and the `append` limitation that partitions are not auto-created.

View File

@ -0,0 +1,938 @@
# Partition Feature Design for generic_loader
## 1. Objective
Add PostgreSQL LIST partitioning support to [`load_sas.py`](generic_loader/load_sas.py) and [`load_folder.py`](generic_loader/load_folder.py) without changing the existing streaming `COPY` data path in [`copy_dataframes()`](generic_loader/load_sas.py:1028). The feature must be YAML-driven, must support cascading partition levels, and must keep non-partitioned behavior unchanged.
## 2. Current baseline
### Single-file loader
The single-file path is centered in [`generic_loader/load_sas.py`](generic_loader/load_sas.py):
- [`LoaderConfig`](generic_loader/load_sas.py:273) stores file path, target schema/table, `if_exists`, and column filters.
- [`load_config()`](generic_loader/load_sas.py:350) parses YAML.
- [`read_sas_preview()`](generic_loader/load_sas.py:430) reads a bounded preview for schema inference.
- [`infer_schema()`](generic_loader/load_sas.py:637) infers Postgres column types.
- [`render_create_table()`](generic_loader/load_sas.py:756) renders one non-partitioned `CREATE TABLE` statement.
- [`create_table()`](generic_loader/load_sas.py:890) executes table creation or append/replace checks.
- [`copy_dataframes()`](generic_loader/load_sas.py:1028) streams chunks into the target table via `COPY ... FROM STDIN`.
### Folder loader
The folder path is centered in [`generic_loader/load_folder.py`](generic_loader/load_folder.py):
- [`ClusterSpec`](generic_loader/load_folder.py:137) stores resolved per-cluster load settings.
- [`_ExplicitPattern`](generic_loader/load_folder.py:148) stores raw per-cluster YAML overrides.
- [`FolderConfig`](generic_loader/load_folder.py:160) stores folder defaults.
- [`load_folder_config()`](generic_loader/load_folder.py:200) parses folder YAML.
- [`discover_clusters()`](generic_loader/load_folder.py:295) resolves inheritance and groups files.
- [`load_cluster()`](generic_loader/load_folder.py:385) creates a table from the first file and streams every file in the cluster into it.
### Important current behaviors to preserve
- [`copy_dataframes()`](generic_loader/load_sas.py:1028) copies into exactly one qualified table name and should remain unchanged.
- [`create_table()`](generic_loader/load_sas.py:890) owns `if_exists` semantics and should remain the single gate for fail/replace/append behavior.
- Warnings are currently emitted to stderr as `[warn] ...`, for example in [`_assert_schema_compatible()`](generic_loader/load_sas.py:826), and the feature should follow that pattern instead of introducing a repository-wide logging refactor.
## 3. Scope and non-goals
### In scope
- Optional YAML `partition_by` support.
- Configurable `max_partitions` threshold with default `10000`.
- Single-level and multi-level cascading LIST partitions.
- Partition value discovery from the incoming dataset at runtime.
- Recursive DDL generation for parent and child partitions.
- Folder-level defaults plus per-cluster overrides.
- Dry-run output for the full DDL tree.
### Explicitly out of scope for this implementation
- RANGE or HASH partitioning.
- Expression-based partition keys.
- Changing row-routing behavior in [`copy_dataframes()`](generic_loader/load_sas.py:1028).
- Automatically creating missing partitions in `append` mode.
- Reworking manifest validation in [`validate_against_manifest()`](generic_loader/load_sas.py:1102).
## 4. YAML schema changes
## 4.1 Single-file config
Update the sample shape documented by [`generic_loader/sample_config.yaml`](generic_loader/sample_config.yaml) to include `partition_by` and `max_partitions`.
### Proposed exact example
```yaml
filename: samples/sample_kitchensink.xpt
schemaname: public
tablename: kitchensink
# Optional. If set, only these columns are loaded. Mutually exclusive with exclude.
# include:
# - ID
# - INTCOL
# - DATECOL
# Optional. Columns to drop.
# exclude:
# - ALLNULL
# Optional. Create cascading LIST partitions in this order.
# Omit or set [] for no partitioning.
partition_by:
- state
- zip
# Optional. Warn if the load would create more than this many partition tables.
# The load continues. Default: 10000.
max_partitions: 10000
# What to do if the target table already exists: fail | replace | append
# Defaults to fail.
if_exists: append
```
### Parsing and validation rules
1. `partition_by` is optional.
2. Omitted, `null`, or `[]` means "not partitioned".
3. When present and non-empty, it must be a YAML sequence of non-empty strings.
4. Order matters. `['state', 'zip']` means `state` is level 1 and `zip` is level 2.
5. Duplicate names are invalid.
6. If `include` is present, every `partition_by` column must be included.
7. If `exclude` is present, no `partition_by` column may be excluded.
8. `max_partitions` is optional and defaults to `10000`.
9. `max_partitions` must be an integer greater than `0`.
## 4.2 Folder config
Update the sample shape documented by [`generic_loader/sample_folder_config.yaml`](generic_loader/sample_folder_config.yaml) to include folder defaults and per-cluster overrides.
### Proposed exact example
```yaml
folder: samples/folder_test
schemaname: public
# Applied when creating the first file of each cluster.
# One of: fail | replace | append. Default: fail.
if_exists: replace
# When true (default), any file not matched by an explicit pattern below is
# auto-grouped with its peers.
auto_detect: true
# Optional folder-level column filter.
# include:
# - ID
# - INTCOL
# exclude:
# - ALLNULL
# Optional folder default for LIST partitioning.
partition_by:
- state
- zip
# Optional folder default threshold. Default: 10000.
max_partitions: 10000
clusters:
- pattern: '^group_a\d+\.xpt$'
tablename: group_a
# Inherits folder-level partition_by and max_partitions.
- pattern: '^group_b\d+\.xpt$'
tablename: group_b
partition_by:
- state
max_partitions: 2000
- pattern: '^standalone\.xpt$'
tablename: standalone
partition_by: [] # Explicit opt-out of the folder default.
```
### Folder override rules
1. Folder-level `partition_by` and `max_partitions` behave as defaults.
2. In an explicit cluster entry:
- if `partition_by` is omitted, inherit the folder-level value;
- if `partition_by` is a non-empty list, replace the folder-level value;
- if `partition_by: []`, explicitly disable partitioning for that cluster.
3. Cluster-level `max_partitions` overrides the folder-level threshold when present.
4. The resolved per-cluster rules should follow the same pattern already used by [`discover_clusters()`](generic_loader/load_folder.py:295) for `if_exists`, `include`, and `exclude`.
## 5. Dataclass changes
## 5.1 Existing public config dataclasses
### [`LoaderConfig`](generic_loader/load_sas.py:273)
Add:
- `partition_by: Optional[List[str]] = None`
- `max_partitions: int = 10000`
### [`ClusterSpec`](generic_loader/load_folder.py:137)
Add resolved fields:
- `partition_by: Optional[List[str]]`
- `max_partitions: int`
### [`_ExplicitPattern`](generic_loader/load_folder.py:148)
Add raw optional override fields:
- `partition_by: Optional[List[str]] = None`
- `max_partitions: Optional[int] = None`
Notes:
- Preserve `partition_by=[]` when it appears in the YAML so [`discover_clusters()`](generic_loader/load_folder.py:295) can distinguish explicit disable from inheritance.
- `max_partitions` remains `None` when omitted so folder inheritance can resolve it later.
### [`FolderConfig`](generic_loader/load_folder.py:160)
Add:
- `partition_by: Optional[List[str]] = None`
- `max_partitions: int = 10000`
## 5.2 Recommended new internal helper dataclasses
These are not required to be public, but they make the implementation substantially safer and clearer.
### Recommended [`PartitionNode`](generic_loader/PARTITION_DESIGN.md)
Suggested fields:
- `field_name: str`
- `value: Any`
The normalized value Postgres will see during `COPY`; use `None` for SQL `NULL`.
- `table_name: str`
- `children: List[PartitionNode] = field(default_factory=list)`
### Recommended [`PartitionPlan`](generic_loader/PARTITION_DESIGN.md)
Suggested fields:
- `fields: List[str]`
- `roots: List[PartitionNode]`
- `total_partition_tables: int`
The implementation can use nested dicts instead, but an explicit plan object reduces naming, recursion, and dry-run bugs.
## 6. New functions needed
The exact names may vary, but the design should introduce helpers with the responsibilities below.
## 6.1 Config parsing helpers
### Recommended [`_parse_partition_by()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Parse `partition_by` from YAML.
- Enforce list-of-strings validation.
- Normalize omitted/empty top-level values to `None`.
- Preserve cluster-level empty list `[]` long enough for override resolution.
### Recommended [`_parse_max_partitions()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Parse and validate `max_partitions`.
- Enforce positive integer semantics.
## 6.2 Partition validation helpers
### Recommended [`_validate_partition_columns()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Ensure every requested partition column exists after [`apply_column_filter()`](generic_loader/load_sas.py:468).
- Fail early if a partition column was removed by `include` or `exclude`.
- Produce context-rich errors that name the config, file, or cluster.
### Recommended [`_assert_partition_compatible()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- In `append` mode, verify that the existing parent table is LIST-partitioned on the same ordered keys.
- Reuse [`SchemaCompatibilityError`](generic_loader/load_sas.py:307) for incompatibility.
Expected catalog check:
- Query `pg_partitioned_table` for `partstrat`.
- Query `pg_attribute` using `partattrs` order to get the parent key columns.
- Require `partstrat = 'l'`.
- Require the ordered key list to exactly equal the resolved `partition_by` list.
## 6.3 Partition discovery helpers
### Recommended [`discover_partition_values()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Scan an iterable of filtered DataFrames.
- Normalize the partition columns the same way [`_prepare_for_copy()`](generic_loader/load_sas.py:943) will normalize them for `COPY`.
- Build a cascading partition tree scoped by parent value.
- Count the child partition tables that will be created.
Suggested input shape:
- `dfs: Iterable[pd.DataFrame]`
- `columns: Dict[str, ColumnSpec]`
- `partition_by: List[str]`
- `root_table_name: str`
Suggested output shape:
- `PartitionPlan`
### Recommended [`_warn_if_partition_count_exceeds()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Emit `[warn] ...` to stderr if `plan.total_partition_tables > max_partitions`.
- Never abort the load.
## 6.4 Naming and literal helpers
### Recommended [`_sanitize_partition_token()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Convert a normalized partition value into a safe, deterministic table-name suffix.
### Recommended [`_build_partition_table_name()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Combine parent name and sanitized token.
- Enforce Postgres identifier-length limits.
- Resolve collisions deterministically.
### Recommended [`_render_partition_literal()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Render one value for `FOR VALUES IN (...)`.
- Preserve the exact routed value Postgres will see during `COPY`.
## 6.5 DDL rendering helpers
### Recommended [`render_partition_ddl()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Render child `CREATE TABLE ... PARTITION OF ...` statements recursively.
### Recommended [`render_create_table_statements()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Return the full ordered statement list for dry-run and actual execution.
- Keep the root statement first.
- Append recursive child statements afterward.
## 6.6 Optional shared warning helper
### Recommended [`_warn()`](generic_loader/PARTITION_DESIGN.md)
Purpose:
- Centralize the existing `[warn] ...` stderr behavior.
- Let both existing schema warnings and new partition warnings share one implementation.
## 7. Modified functions
## 7.1 [`load_config()`](generic_loader/load_sas.py:350)
Modify to:
1. Parse `partition_by`.
2. Parse `max_partitions`.
3. Validate include/exclude conflicts with `partition_by`.
4. Return the new fields in [`LoaderConfig`](generic_loader/load_sas.py:273).
## 7.2 [`render_create_table()`](generic_loader/load_sas.py:756)
Modify signature to accept optional partition metadata:
- `partition_by: Optional[List[str]] = None`
Behavior:
- If `partition_by` is falsy, keep current output unchanged.
- If `partition_by` is present, append `PARTITION BY LIST (<first field>)` to the parent statement.
- This function should still render only the parent statement; child statements belong in [`render_partition_ddl()`](generic_loader/PARTITION_DESIGN.md).
Example parent output:
```sql
CREATE TABLE "public"."customers" (
"state" TEXT,
"zip" TEXT,
"name" TEXT
) PARTITION BY LIST ("state");
```
## 7.3 [`_drop_table()`](generic_loader/load_sas.py:771)
Add an optional `cascade: bool = False` parameter.
Behavior:
- Non-partitioned replace keeps current plain `DROP TABLE` behavior.
- Partitioned replace uses `DROP TABLE <qualified> CASCADE` so the parent drop removes all partitions.
## 7.4 [`create_table()`](generic_loader/load_sas.py:890)
Extend signature to accept:
- `partition_by: Optional[List[str]] = None`
- `partition_plan: Optional[PartitionPlan] = None`
Behavior:
1. Preserve current `if_exists` validation.
2. For non-partitioned loads, preserve current behavior.
3. For partitioned loads:
- `fail`: if the parent table exists, raise [`TableExistsError`](generic_loader/load_sas.py:303).
- `replace`: if the parent exists, drop it with `CASCADE`, then recreate the full tree.
- `append`: run [`_assert_schema_compatible()`](generic_loader/load_sas.py:826) plus [`_assert_partition_compatible()`](generic_loader/PARTITION_DESIGN.md), then return without creating any partitions.
4. When creation is needed, execute the full statement list returned by [`render_create_table_statements()`](generic_loader/PARTITION_DESIGN.md).
5. Reject `partition_by` without a computed `partition_plan` when creation or dry-run rendering needs it.
## 7.5 [`_prepare_for_copy()`](generic_loader/load_sas.py:943)
Recommended refactor:
- Extract or share the per-column normalization logic so partition discovery can use the same conversion rules.
- Do not change external behavior of the returned DataFrame.
Reason:
- The partition discovery pass must reason about the same values Postgres will actually receive.
- The most important special case is text columns, where empty strings currently become SQL `NULL` because [`copy_dataframes()`](generic_loader/load_sas.py:1028) uses `NULL ''`.
## 7.6 [`main()`](generic_loader/load_sas.py:1192)
Modify the single-file flow as follows:
1. Load config.
2. Read preview and infer schema exactly as today.
3. Validate that partition columns exist after filtering.
4. If `partition_by` is set and the operation needs creation or dry-run rendering, run a full discovery pass over the file to build a `PartitionPlan`.
5. In dry-run mode, print the full DDL statement list rather than only the parent statement.
6. In live mode, pass `partition_by` and `partition_plan` into [`create_table()`](generic_loader/load_sas.py:890).
7. Keep [`copy_dataframes()`](generic_loader/load_sas.py:1028) unchanged so data is copied to the parent table and Postgres routes rows automatically.
## 7.7 [`load_folder_config()`](generic_loader/load_folder.py:200)
Modify to:
1. Parse folder-level `partition_by` and `max_partitions`.
2. Parse per-cluster `partition_by` and `max_partitions`.
3. Validate include/exclude conflicts against the applicable partition list where possible.
4. Preserve explicit `partition_by: []` so cluster discovery can treat it as "disable inheritance".
## 7.8 [`discover_clusters()`](generic_loader/load_folder.py:295)
Modify to resolve per-cluster partition settings.
For each resolved [`ClusterSpec`](generic_loader/load_folder.py:137):
- `partition_by = patt.partition_by if patt.partition_by is not None else cfg.partition_by`
- `max_partitions = patt.max_partitions if patt.max_partitions is not None else cfg.max_partitions`
- normalize resolved empty list to `None` before storing on the final [`ClusterSpec`](generic_loader/load_folder.py:137)
## 7.9 [`load_cluster()`](generic_loader/load_folder.py:385)
Modify the cluster load order to:
1. Infer schema from the first file exactly as today.
2. Validate partition columns against that schema.
3. If the cluster is partitioned and the operation is not append-only verification, scan all files in the cluster to build one shared `PartitionPlan`.
4. Call [`create_table()`](generic_loader/load_sas.py:890) with resolved `partition_by` and `partition_plan`.
5. Stream all files into the parent table exactly as today.
## 7.10 [`main()`](generic_loader/load_folder.py:496)
Modify dry-run behavior:
- keep cluster discovery output;
- for each loadable cluster, print full DDL, not only one `CREATE TABLE` statement;
- when a cluster is partitioned, perform partition discovery across every file in that cluster, not only the first file.
Also update the `--dry-run` help text because the current wording says the schema is inferred from only the first file of the cluster.
## 8. Partition value discovery algorithm
## 8.1 High-level rules
1. Discovery operates on filtered data, meaning after the same column filter logic used by [`apply_column_filter()`](generic_loader/load_sas.py:468).
2. Discovery must use the same semantic values that Postgres will see during `COPY`, not raw pandas object identity.
3. The scan should be streaming and chunk-based to avoid materializing the full file or cluster in memory.
4. The resulting tree must scope each level under its parent so deeper values are not treated as globally unique.
## 8.2 Normalization rules for partition keys
Partition discovery should normalize each partition column using the same type-aware logic already embodied in [`_prepare_for_copy()`](generic_loader/load_sas.py:943), with the following behavior:
- Integer-like columns (`INTEGER`, `BIGINT`, `SMALLINT`): coerce object values through numeric conversion, treat blank strings and NaN as `NULL`.
- Floating/numeric columns (`DOUBLE PRECISION`, `REAL`, `NUMERIC`): numeric conversion, NaN becomes `NULL`.
- Date columns: normalize to `datetime.date` or `NULL`.
- Timestamp columns: normalize to `datetime.datetime` or `NULL`.
- Time columns: normalize through the existing time conversion path or `NULL`.
- Text-like columns: `None`, pandas nulls, and `''` all become semantic `NULL`, because [`copy_dataframes()`](generic_loader/load_sas.py:1028) sends empty strings with `NULL ''`.
- Boolean columns: normalize to `True`, `False`, or `NULL`.
This means partition discovery deduplicates on the routed value, not the raw source representation. For example, `'00123'` and `123` in an integer partition column should produce one partition value `123`, not two separate partitions.
## 8.3 Discovery pseudocode
```python
def discover_partition_values(dfs, columns, partition_by, root_table_name):
validate_partition_columns(columns, partition_by)
root = PartitionPlan(fields=partition_by, roots=[], total_partition_tables=0)
root_index = {} # normalized value -> PartitionNode for depth 0
for df in dfs:
if df.empty:
continue
part_df = df[partition_by].copy()
part_df = normalize_partition_frame(part_df, columns)
unique_paths = part_df.drop_duplicates()
for path in unique_paths.itertuples(index=False, name=None):
parent_table = root_table_name
parent_children = root.roots
parent_index = root_index
for depth, value in enumerate(path):
field_name = partition_by[depth]
if value not in parent_index:
child_table = build_partition_table_name(parent_table, value)
node = PartitionNode(
field_name=field_name,
value=value,
table_name=child_table,
)
parent_index[value] = node
parent_children.append(node)
root.total_partition_tables += 1
node = parent_index[value]
parent_table = node.table_name
parent_children = node.children
parent_index = getattr(node, "_index", {})
sort_every_node_deterministically(root)
return root
```
## 8.4 Efficient implementation notes
- The scan should retain only the partition columns for the current chunk after filtering.
- The in-memory structure should grow only with the number of unique partition nodes, not the number of rows.
- Reading partition values from the preview frame is only valid when that frame is known to contain the entire dataset. In the current CLI flow, the preview is normally not exhaustive, so partitioned loads should perform a full chunked scan.
- A future optimization may add optional reader-level column pruning to [`iter_sas_chunks()`](generic_loader/load_sas.py:447) and [`read_sas_preview()`](generic_loader/load_sas.py:430), but that is not required for correctness.
## 9. DDL generation algorithm
## 9.1 Root table
If `partition_by` is set, the parent statement produced by [`render_create_table()`](generic_loader/load_sas.py:756) must end with:
```sql
PARTITION BY LIST ("<first partition field>")
```
The parent still contains the full column list.
## 9.2 Child tables
For each discovered node:
- if it is not the last partition level, create a child partition that is itself subpartitioned by the next field;
- if it is the last partition level, create a leaf partition with no further `PARTITION BY` clause.
Examples for `partition_by: [state, zip]`:
```sql
CREATE TABLE "public"."customers_ca"
PARTITION OF "public"."customers"
FOR VALUES IN ('CA')
PARTITION BY LIST ("zip");
CREATE TABLE "public"."customers_ca_60601"
PARTITION OF "public"."customers_ca"
FOR VALUES IN ('60601');
```
## 9.3 DDL rendering pseudocode
```python
def render_create_table_statements(schema, table, columns, partition_by, plan):
statements = [render_create_table(schema, table, columns, partition_by=partition_by)]
if partition_by:
statements.extend(render_partition_ddl(schema, table, columns, partition_by, plan.roots, depth=0))
return statements
def render_partition_ddl(schema, parent_table, columns, partition_by, nodes, depth):
field_name = partition_by[depth]
next_field = partition_by[depth + 1] if depth + 1 < len(partition_by) else None
field_spec = columns[field_name]
statements = []
for node in nodes:
literal = render_partition_literal(node.value, field_spec)
if next_field is None:
statements.append(
f'CREATE TABLE {qualified(schema, node.table_name)} '
f'PARTITION OF {qualified(schema, parent_table)} '
f'FOR VALUES IN ({literal});'
)
else:
statements.append(
f'CREATE TABLE {qualified(schema, node.table_name)} '
f'PARTITION OF {qualified(schema, parent_table)} '
f'FOR VALUES IN ({literal}) '
f'PARTITION BY LIST ({quote_ident(next_field)});'
)
statements.extend(
render_partition_ddl(
schema,
node.table_name,
columns,
partition_by,
node.children,
depth + 1,
)
)
return statements
```
## 9.4 Statement order
Emit statements in this order:
1. parent table;
2. each level-1 child;
3. that childs descendants before moving to the next sibling.
This depth-first order guarantees that every parent exists before its children are created.
## 10. Table-name sanitization rules
The child-table name rule must be deterministic and explicit.
## 10.1 Base token generation
For each normalized partition value:
1. Convert to a display token:
- `None` -> `null`
- `datetime.date`, `datetime.time`, `datetime.datetime` -> `isoformat()` string
- everything else -> `str(value)`
2. Lowercase the token.
3. Replace every run of one or more non-alphanumeric characters with `_`.
4. Trim leading and trailing `_`.
5. If the result is empty, use `value`.
Examples:
- `CA` -> `ca`
- `New York` -> `new_york`
- `60601-1234` -> `60601_1234`
- `NULL` -> `null`
- `***` -> `value`
## 10.2 Final child name
Child names are:
```text
{parent_table}_{sanitized_token}
```
Examples:
- `customers` + `CA` -> `customers_ca`
- `customers_ca` + `60601` -> `customers_ca_60601`
## 10.3 Length limit
Postgres identifiers are limited to 63 bytes. The implementation should treat 63 characters as the working limit because the loader currently emits ASCII-only sanitized suffixes.
Rules:
1. If `len(parent_table) >= 62`, fail fast with a clear error because there is no room for `_x`.
2. Otherwise, reserve `len(parent_table) + 1` characters for the prefix and underscore.
3. Truncate only the sanitized token, not the parent prefix.
4. If truncation makes two child names collide, append a deterministic short hash.
## 10.4 Collision handling
Different raw values can sanitize to the same token, for example:
- `A-B` -> `a_b`
- `A B` -> `a_b`
Recommended collision rule:
1. First candidate: `parent_a_b`
2. On collision, append `_<hash8>` derived from the exact normalized value for that node.
3. Re-truncate the base token as needed so the final name still fits the 63-character limit.
Example:
- `parent_a_b`
- `parent_a_b_f15c2d19`
This keeps names deterministic across runs and avoids dependence on discovery order.
## 11. Partition literal rendering rules
The `FOR VALUES IN (...)` clause must use the exact routed value Postgres will receive after loader normalization.
Recommended rendering rules:
- `NULL` -> `NULL`
- text -> single-quoted with internal quotes escaped
- integers / numerics -> unquoted numeric literal
- boolean -> `TRUE` or `FALSE`
- date -> `DATE 'YYYY-MM-DD'`
- timestamp -> `TIMESTAMP 'YYYY-MM-DD HH:MM:SS'`
- time -> `TIME 'HH:MM:SS'`
Important special case:
- text `''` must not render as `''`; it must render as `NULL` because [`copy_dataframes()`](generic_loader/load_sas.py:1028) uses `NULL ''`.
## 12. `if_exists` interaction
## 12.1 `fail`
- If the parent table exists, behavior is unchanged: raise [`TableExistsError`](generic_loader/load_sas.py:303).
- No partition compatibility inspection is needed because the operation stops immediately.
## 12.2 `replace`
- If the parent table exists and the config is partitioned, execute `DROP TABLE <parent> CASCADE`.
- Recreate the parent plus every partition statement in one transaction.
- If any statement fails, let the outer transaction rollback preserve atomicity.
## 12.3 `append`
Required behavior:
1. Run [`_assert_schema_compatible()`](generic_loader/load_sas.py:826) on the parent table exactly as today.
2. If `partition_by` is configured, also verify that the parent is LIST-partitioned on the same ordered keys.
3. Do not create any partitions.
4. Copy rows to the parent table and let Postgres route them.
Accepted limitation for v1:
- If the existing partition tree does not contain a leaf partition for some incoming value, Postgres will fail during `COPY` with a native partition-routing error.
- This design does not require preflight catalog validation of every leaf partition because that adds significant scope and catalog-parsing complexity.
## 13. Dry-run behavior
## 13.1 Single-file loader
Current dry-run behavior in [`main()`](generic_loader/load_sas.py:1192) prints only one statement from [`render_create_table()`](generic_loader/load_sas.py:756). For partitioned configs it should change to:
1. infer schema from the preview as today;
2. run full partition discovery over the file;
3. warn on stderr if `total_partition_tables > max_partitions`;
4. print the full ordered DDL statement list to stdout;
5. open no database connection.
Output format recommendation:
- print statements separated by one blank line for readability;
- do not print extra prose on stdout, so the output remains easy to paste into SQL tooling.
## 13.2 Folder loader
Current dry-run behavior in [`main()`](generic_loader/load_folder.py:496) prints one `CREATE TABLE` per cluster based on the first file only. For partitioned clusters it should change to:
1. keep printing the discovered cluster summary;
2. for each loadable cluster, print a header such as `--- DDL for cluster 'group_a' ---`;
3. infer schema from the first file as today;
4. if the cluster is partitioned, scan all files in that cluster to build one shared `PartitionPlan`;
5. print the full ordered DDL statement list.
Important documentation note:
- Partitioned dry-runs are now full-data scans over the partition columns and can take much longer than non-partitioned dry-runs.
## 14. Error handling
The implementation should handle failures at the earliest safe point with clear messages.
## 14.1 Config-time errors
Raise `ValueError` from [`load_config()`](generic_loader/load_sas.py:350) or [`load_folder_config()`](generic_loader/load_folder.py:200) for:
- `partition_by` not being a list
- empty or non-string items inside `partition_by`
- duplicate partition column names
- `max_partitions <= 0`
- `include` omitting a partition column
- `exclude` removing a partition column
- cluster config specifying an invalid override shape
## 14.2 Runtime validation errors before DDL
Raise `ValueError` with file/cluster context for:
- partition column not present after filtering
- partition column absent from the inferred schema
- parent table name too long to create child suffixes safely
- a partition value that cannot be normalized or rendered into SQL
## 14.3 Append-time compatibility errors
Raise [`SchemaCompatibilityError`](generic_loader/load_sas.py:307) for:
- parent column mismatch detected by [`_assert_schema_compatible()`](generic_loader/load_sas.py:826)
- existing parent not being partitioned when `partition_by` is configured
- existing parent using a partition strategy other than LIST
- existing parent using a different ordered key list
## 14.4 Warning-only conditions
Emit `[warn] ...` to stderr, but continue, for:
- `total_partition_tables > max_partitions`
- existing warnings already emitted by [`_assert_schema_compatible()`](generic_loader/load_sas.py:826)
Recommended warning message:
```text
[warn] partition plan for public.customers will create 12,431 partition tables, exceeding max_partitions=10,000
```
## 14.5 Postgres runtime errors left to bubble
Do not swallow driver/database exceptions for:
- DDL execution failures
- `COPY` failures caused by missing append-mode partitions
- any transaction failure during live loading
The outer transaction handling in [`main()`](generic_loader/load_sas.py:1192) and [`main()`](generic_loader/load_folder.py:496) should remain responsible for rollback.
## 15. Detailed single-file flow after the change
```text
load_config
-> read_sas_preview
-> apply_column_filter
-> infer_schema
-> validate partition columns
-> if validate flag: run manifest validation
-> if partitioned and (dry-run or create needed): discover partition values from full file
-> if dry-run: print full DDL and exit
-> connect
-> create_table (with partition metadata)
-> copy_dataframes to parent table
-> commit / rollback exactly as today
```
Notes:
- A partitioned live load usually requires one preview read, one full discovery pass, and one full load pass.
- This is a deliberate tradeoff to ensure the full partition tree exists before any row is copied.
## 16. Detailed folder flow after the change
For each cluster in [`load_cluster()`](generic_loader/load_folder.py:385):
```text
infer schema from first file preview
-> validate partition columns
-> if partitioned and creation is needed: discover partition values across all files in the cluster
-> create_table (with partition metadata)
-> stream every file to the parent table
-> for later files, keep the existing append-mode schema compatibility check
```
Notes:
- The partition plan is cluster-wide, not file-by-file.
- All files in the cluster must route into one shared partition tree under the same parent table.
## 17. What remains unchanged
- [`infer_schema()`](generic_loader/load_sas.py:637) keeps its current type-inference behavior.
- [`copy_dataframes()`](generic_loader/load_sas.py:1028) remains unchanged and still copies to the parent table.
- [`assert_schema_compatible()`](generic_loader/load_sas.py:874) remains the public wrapper for append compatibility.
- Non-partitioned configs should continue to produce exactly one `CREATE TABLE` statement and the same load behavior as today.
## 18. Implementation sequencing
Recommended implementation order:
1. Extend config dataclasses and parsers.
2. Add partition parsing/validation helpers.
3. Add internal partition plan data structure.
4. Add partition discovery and literal/name helpers.
5. Extend DDL rendering.
6. Extend [`create_table()`](generic_loader/load_sas.py:890) and [`_drop_table()`](generic_loader/load_sas.py:771).
7. Wire the single-file flow.
8. Wire the folder flow and inheritance rules.
9. Update dry-run/help text and sample YAML files.
## 19. QA and validation matrix
The implementation should be validated against at least these scenarios:
1. Non-partitioned single-file load still behaves exactly as before.
2. Single-level text partitioning creates one child per unique value.
3. Multi-level cascading partitioning scopes child values to their parent.
4. `NULL` partition values create `FOR VALUES IN (NULL)` partitions.
5. Text empty strings route to the `NULL` partition, not `''`.
6. Sanitization collision (`A-B` vs `A B`) resolves deterministically.
7. Very long child names truncate correctly and still remain unique.
8. `max_partitions` warning appears but the load continues.
9. `replace` drops the parent with `CASCADE` and recreates the full tree.
10. `append` rejects a parent with the wrong partition strategy or key order.
11. Folder-level `partition_by` is inherited by auto-detected clusters.
12. Explicit cluster `partition_by` overrides folder defaults.
13. Explicit cluster `partition_by: []` disables a folder default.
14. Dry-run prints the full DDL tree and opens no connection.
15. Partitioned folder dry-run scans all files in the cluster, not just the first one.
## 20. Documentation updates required
In addition to implementing the code, update:
- [`generic_loader/sample_config.yaml`](generic_loader/sample_config.yaml) with `partition_by` and `max_partitions` comments and examples.
- [`generic_loader/sample_folder_config.yaml`](generic_loader/sample_folder_config.yaml) with folder defaults, cluster overrides, and explicit opt-out examples.
- The module-level usage text in [`load_sas.py`](generic_loader/load_sas.py) so dry-run docs mention full DDL for partitioned tables.
- The module-level usage text in [`load_folder.py`](generic_loader/load_folder.py) so dry-run docs mention cluster-wide partition discovery.
## 21. Final design summary
The safest low-regression approach is:
1. keep the current schema inference path unchanged;
2. add a separate full-data partition discovery pass for partitioned loads;
3. render one parent `CREATE TABLE` plus recursive `PARTITION OF` child statements;
4. create or replace the full tree before copying any data;
5. leave [`copy_dataframes()`](generic_loader/load_sas.py:1028) unchanged so PostgreSQL handles routing;
6. keep `append` mode strict about parent compatibility and intentionally do not auto-create missing partitions.
That approach satisfies the feature requirements while containing code churn to config parsing, DDL rendering, runtime planning, and folder integration.

View File

@ -32,9 +32,19 @@ USAGE
# include: [ID, INTCOL]
# exclude: [ALLNULL]
# Optional folder default for LIST partitioning. Omit or set [] for no
# partitioning. Accepts a single string or a list of column names.
# partition_by:
# - state
# - zip
# Optional folder default threshold. Default: 10000.
# max_partitions: 10000
# Optional explicit cluster patterns. Each pattern is matched against the
# file *basename*. Matched files are pulled out of the auto-detect pool.
# Per-cluster if_exists/include/exclude override the folder-level defaults.
# Per-cluster if_exists/include/exclude/partition_by/max_partitions
# override the folder-level defaults.
clusters:
- pattern: '^group_a\\d+\\.sas7bdat$'
tablename: group_a
@ -51,9 +61,10 @@ USAGE
Flags:
--config PATH Required. Path to the YAML config above.
--dry-run Print the discovered clusters and the inferred CREATE
TABLE for each (schema from the first file of the
cluster). The database is never touched.
--dry-run Print the discovered clusters and the inferred DDL for
each (CREATE TABLE plus partition DDL when applicable).
For partitioned clusters all files are scanned to
discover partition values. The database is never touched.
--fail-fast Abort the whole run on the first cluster failure.
Default is to log the failure, roll that cluster back,
and keep going.
@ -113,15 +124,19 @@ from dotenv import load_dotenv
from load_sas import (
VALID_IF_EXISTS,
_count_partitions,
_merge_partition_trees,
apply_column_filter,
assert_schema_compatible,
connect,
copy_dataframes,
create_table,
discover_partition_values_chunked,
infer_schema,
iter_sas_chunks,
read_sas_preview,
render_create_table,
render_partition_ddl,
)
@ -135,6 +150,12 @@ SAS_EXTENSIONS = (".sas7bdat", ".xpt", ".xport")
@dataclass
class ClusterSpec:
"""Resolved per-cluster load settings.
``partition_by`` and ``max_partitions`` are resolved from the folder
defaults and any per-cluster overrides during :func:`discover_clusters`.
"""
tablename: str
files: List[Path]
if_exists: str
@ -142,11 +163,18 @@ class ClusterSpec:
exclude: Optional[List[str]]
source: str # "explicit" or "auto"
pattern: Optional[str] = None
partition_by: List[str] = field(default_factory=list)
max_partitions: int = 10_000
@dataclass
class _ExplicitPattern:
"""Parsed form of a single ``clusters[*]`` YAML entry."""
"""Parsed form of a single ``clusters[*]`` YAML entry.
``partition_by`` defaults to ``None`` meaning "inherit from folder level".
An explicit empty list ``[]`` means "disable partitioning for this cluster".
``max_partitions`` defaults to ``None`` meaning "inherit from folder level".
"""
pattern: re.Pattern
raw_pattern: str
@ -154,10 +182,18 @@ class _ExplicitPattern:
if_exists: Optional[str] = None
include: Optional[List[str]] = None
exclude: Optional[List[str]] = None
partition_by: Optional[List[str]] = None
max_partitions: Optional[int] = None
@dataclass
class FolderConfig:
"""Folder-level configuration parsed from YAML.
``partition_by`` and ``max_partitions`` serve as defaults for every
cluster unless overridden at the cluster level.
"""
folder: Path
schemaname: str
if_exists: str = "fail"
@ -165,6 +201,8 @@ class FolderConfig:
include: Optional[List[str]] = None
exclude: Optional[List[str]] = None
explicit: List[_ExplicitPattern] = field(default_factory=list)
partition_by: List[str] = field(default_factory=list)
max_partitions: int = 10_000
# ---------------------------------------------------------------------------
@ -197,8 +235,90 @@ def _parse_columns_filter(
return include_out, exclude_out
def _parse_partition_by(
raw_value: Any, where: str, *, allow_none: bool = False
) -> Optional[List[str]]:
"""Parse a ``partition_by`` value from YAML.
Returns a list of non-empty, unique column name strings. When
``allow_none`` is True (used for per-cluster entries), an omitted key
returns ``None`` to signal "inherit from folder level". An explicit
empty list ``[]`` always returns ``[]``.
"""
if raw_value is None:
return None if allow_none else []
if isinstance(raw_value, str):
if not raw_value.strip():
raise ValueError(f"{where}: 'partition_by' string must be non-empty.")
return [raw_value.strip()]
if isinstance(raw_value, list):
if len(raw_value) == 0:
return []
result: List[str] = []
for i, item in enumerate(raw_value):
if not isinstance(item, str) or not item.strip():
raise ValueError(
f"{where}: 'partition_by[{i}]' must be a non-empty string."
)
result.append(str(item).strip())
if len(result) != len(set(result)):
raise ValueError(
f"{where}: 'partition_by' contains duplicate column names."
)
return result
raise ValueError(
f"{where}: 'partition_by' must be a string or list of strings."
)
def _parse_max_partitions(
raw_value: Any, where: str, *, allow_none: bool = False
) -> Optional[int]:
"""Parse a ``max_partitions`` value from YAML.
Returns a positive integer. When ``allow_none`` is True (used for
per-cluster entries), an omitted key returns ``None`` to signal
"inherit from folder level".
"""
if raw_value is None:
return None if allow_none else 10_000
try:
value = int(raw_value)
except (TypeError, ValueError):
raise ValueError(
f"{where}: 'max_partitions' must be a positive integer, "
f"got {raw_value!r}"
)
if value <= 0:
raise ValueError(
f"{where}: 'max_partitions' must be a positive integer, "
f"got {value}"
)
return value
def _validate_partition_vs_columns(
partition_by: List[str],
exclude: Optional[List[str]],
where: str,
) -> None:
"""Raise if any ``partition_by`` column is in the ``exclude`` list."""
if not partition_by or exclude is None:
return
excluded_parts = [c for c in partition_by if c in exclude]
if excluded_parts:
raise ValueError(
f"{where}: 'exclude' removes partition_by columns: {excluded_parts}"
)
def load_folder_config(path: Path) -> FolderConfig:
"""Parse and validate the folder-level YAML config at ``path``."""
"""Parse and validate the folder-level YAML config at ``path``.
Supports optional ``partition_by`` and ``max_partitions`` at both the
folder level (defaults for all clusters) and per explicit cluster entry
(overrides the folder default).
"""
path = Path(path)
with path.open("r", encoding="utf-8") as f:
raw = yaml.safe_load(f)
@ -221,6 +341,15 @@ def load_folder_config(path: Path) -> FolderConfig:
include, exclude = _parse_columns_filter(raw, f"Config {path}")
# -- folder-level partition settings ------------------------------------
partition_by = _parse_partition_by(
raw.get("partition_by"), f"Config {path}"
)
max_partitions = _parse_max_partitions(
raw.get("max_partitions"), f"Config {path}"
)
_validate_partition_vs_columns(partition_by, exclude, f"Config {path}")
explicit: List[_ExplicitPattern] = []
clusters_raw = raw.get("clusters") or []
if not isinstance(clusters_raw, list):
@ -242,6 +371,19 @@ def load_folder_config(path: Path) -> FolderConfig:
else None
)
c_include, c_exclude = _parse_columns_filter(entry, where)
# -- per-cluster partition settings ---------------------------------
c_partition_by = _parse_partition_by(
entry.get("partition_by"), where, allow_none=True
)
c_max_partitions = _parse_max_partitions(
entry.get("max_partitions"), where, allow_none=True
)
# Validate partition_by vs the effective exclude for this cluster.
effective_exclude = c_exclude if c_exclude is not None else exclude
effective_pb = c_partition_by if c_partition_by is not None else partition_by
_validate_partition_vs_columns(effective_pb, effective_exclude, where)
explicit.append(
_ExplicitPattern(
pattern=compiled,
@ -250,6 +392,8 @@ def load_folder_config(path: Path) -> FolderConfig:
if_exists=c_if_exists,
include=c_include,
exclude=c_exclude,
partition_by=c_partition_by,
max_partitions=c_max_partitions,
)
)
@ -261,6 +405,8 @@ def load_folder_config(path: Path) -> FolderConfig:
include=include,
exclude=exclude,
explicit=explicit,
partition_by=partition_by,
max_partitions=max_partitions,
)
@ -300,6 +446,13 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
order; files matched by an earlier pattern are removed from the pool
before the next pattern runs. A file matching two patterns triggers a
hard error (that's almost always a config bug).
Partition settings are resolved per cluster:
* For explicit clusters, ``partition_by`` / ``max_partitions`` from the
cluster entry override the folder defaults when present. ``None``
means "inherit"; an explicit ``[]`` disables partitioning.
* For auto-detected clusters, folder defaults are inherited directly.
"""
if not cfg.folder.exists() or not cfg.folder.is_dir():
raise FileNotFoundError(f"Folder not found or not a directory: {cfg.folder}")
@ -320,6 +473,16 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
remaining = list(pool)
for patt in cfg.explicit:
# Resolve partition_by: None = inherit folder, [] = disable, list = override
resolved_pb = (
patt.partition_by if patt.partition_by is not None
else cfg.partition_by
)
resolved_mp = (
patt.max_partitions if patt.max_partitions is not None
else cfg.max_partitions
)
matched = [f for f in remaining if patt.pattern.search(f.name)]
if not matched:
# Not an error - the folder might legitimately not contain files
@ -333,6 +496,8 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
exclude=patt.exclude if patt.exclude is not None else cfg.exclude,
source="explicit",
pattern=patt.raw_pattern,
partition_by=resolved_pb,
max_partitions=resolved_mp,
)
)
continue
@ -346,6 +511,8 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
exclude=patt.exclude if patt.exclude is not None else cfg.exclude,
source="explicit",
pattern=patt.raw_pattern,
partition_by=resolved_pb,
max_partitions=resolved_mp,
)
)
@ -363,6 +530,8 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
include=cfg.include,
exclude=cfg.exclude,
source="auto",
partition_by=cfg.partition_by,
max_partitions=cfg.max_partitions,
)
)
@ -375,6 +544,7 @@ def discover_clusters(cfg: FolderConfig) -> List[ClusterSpec]:
def _infer_cluster_schema(path: Path, include, exclude):
"""Infer the Postgres column schema from a SAS file preview."""
preview_df, meta = read_sas_preview(path)
preview_df = apply_column_filter(preview_df, include, exclude)
total_rows = getattr(meta, "number_rows", None)
@ -382,9 +552,39 @@ def _infer_cluster_schema(path: Path, include, exclude):
return columns
def _discover_cluster_partitions(
cluster: ClusterSpec,
columns: Dict,
) -> dict:
"""Scan ALL files in ``cluster`` to discover partition values.
Returns a nested partition-value tree suitable for passing to
:func:`load_sas.render_partition_ddl` and :func:`load_sas.create_table`.
Each file is scanned chunk-by-chunk so the full dataset is never
materialized in memory.
"""
merged: dict = {}
for path in cluster.files:
def _filtered_chunks(p=path):
for chunk_df, _chunk_meta in iter_sas_chunks(p):
yield apply_column_filter(
chunk_df, cluster.include, cluster.exclude
)
file_tree = discover_partition_values_chunked(
_filtered_chunks(), cluster.partition_by, columns,
)
_merge_partition_trees(merged, file_tree)
return merged
def load_cluster(conn, cluster: ClusterSpec, schemaname: str) -> int:
"""Load every file in ``cluster`` into one table. Returns total rows loaded.
When ``cluster.partition_by`` is non-empty, partition values are
discovered across ALL files before table creation so the full partition
tree exists before any data is copied.
Commits happen per chunk inside :func:`load_sas.copy_dataframes`. If a
file mid-cluster fails, earlier chunks - including chunks from earlier
files in the cluster - stay committed; only the in-flight chunk is
@ -395,8 +595,49 @@ def load_cluster(conn, cluster: ClusterSpec, schemaname: str) -> int:
first, *rest = cluster.files
first_columns = _infer_cluster_schema(first, cluster.include, cluster.exclude)
# -- Partition support --------------------------------------------------
partition_values: Optional[dict] = None
if cluster.partition_by:
# Validate that all partition_by columns exist in the inferred schema.
missing_pcols = [
c for c in cluster.partition_by if c not in first_columns
]
if missing_pcols:
raise ValueError(
f"cluster {cluster.tablename!r}: partition_by references "
f"columns not present in the inferred schema: {missing_pcols}"
)
# Discover partition values across ALL files in the cluster.
# In append mode the partitions already exist, so skip the scan.
if cluster.if_exists == "append":
print(
" [info] append mode: skipping partition discovery "
"(partitions assumed to exist)",
file=sys.stderr,
)
else:
print(
f" discovering partition values across "
f"{len(cluster.files)} file(s)...",
file=sys.stderr,
)
partition_values = _discover_cluster_partitions(
cluster, first_columns,
)
total_parts = _count_partitions(partition_values)
print(
f" discovered {total_parts:,} partition table(s) "
f"across {len(cluster.partition_by)} level(s)",
file=sys.stderr,
)
create_table(
conn, schemaname, cluster.tablename, first_columns, cluster.if_exists
conn, schemaname, cluster.tablename, first_columns, cluster.if_exists,
partition_by=cluster.partition_by or None,
partition_values=partition_values,
max_partitions=cluster.max_partitions,
)
total = 0
@ -459,8 +700,10 @@ def _build_argparser() -> argparse.ArgumentParser:
"--dry-run",
action="store_true",
help=(
"Print discovered clusters and the inferred CREATE TABLE for "
"each; don't touch Postgres."
"Print discovered clusters and the inferred DDL for each "
"(CREATE TABLE plus partition DDL when applicable). For "
"partitioned clusters all files are scanned to discover "
"partition values. The database is never touched."
),
)
p.add_argument(
@ -487,9 +730,12 @@ def _describe_cluster(cluster: ClusterSpec) -> str:
if cluster.pattern:
src += f" pattern={cluster.pattern!r}"
files = ", ".join(f.name for f in cluster.files) or "(no matching files)"
parts = ""
if cluster.partition_by:
parts = f"\n partition_by: {cluster.partition_by}"
return (
f"cluster {cluster.tablename!r} [{src}] if_exists={cluster.if_exists}\n"
f" files: {files}"
f" files: {files}{parts}"
)
@ -522,9 +768,50 @@ def main(argv: Optional[List[str]] = None) -> int:
if args.dry_run:
print()
for c in loadable:
print(f"--- CREATE TABLE for cluster {c.tablename!r} ---")
print(f"--- DDL for cluster {c.tablename!r} ---")
columns = _infer_cluster_schema(c.files[0], c.include, c.exclude)
print(render_create_table(cfg.schemaname, c.tablename, columns))
# Print parent CREATE TABLE (with PARTITION BY if applicable).
print(
render_create_table(
cfg.schemaname, c.tablename, columns,
partition_by=c.partition_by or None,
)
)
# Print child partition DDL when the cluster is partitioned.
if c.partition_by:
# Validate partition columns exist in the schema.
missing_pcols = [
col for col in c.partition_by if col not in columns
]
if missing_pcols:
print(
f" [error] partition_by references columns not in "
f"schema: {missing_pcols}",
file=sys.stderr,
)
else:
print(
f" discovering partition values across "
f"{len(c.files)} file(s)...",
file=sys.stderr,
)
partition_values = _discover_cluster_partitions(
c, columns,
)
total_parts = _count_partitions(partition_values)
print(
f" discovered {total_parts:,} partition table(s) "
f"across {len(c.partition_by)} level(s)",
file=sys.stderr,
)
child_stmts = render_partition_ddl(
cfg.schemaname, c.tablename, c.partition_by,
partition_values, columns,
max_partitions=c.max_partitions,
)
for stmt in child_stmts:
print()
print(stmt)
print()
return 0

View File

@ -217,9 +217,13 @@ from __future__ import annotations
import argparse
import datetime as dt
import getpass
import hashlib
import io
import json
import logging
import math
import os
import re
import sys
from dataclasses import dataclass, field
from pathlib import Path
@ -233,6 +237,9 @@ import yaml
from dotenv import load_dotenv
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Top-level tunables
# ---------------------------------------------------------------------------
@ -263,6 +270,9 @@ gentler on memory."""
VALID_IF_EXISTS = ("fail", "replace", "append")
_PG_IDENT_MAX_LEN = 63
"""PostgreSQL maximum identifier length in bytes (characters for ASCII)."""
# ---------------------------------------------------------------------------
# Dataclasses
@ -277,6 +287,8 @@ class LoaderConfig:
if_exists: str = "fail"
include: Optional[List[str]] = None
exclude: Optional[List[str]] = None
partition_by: List[str] = field(default_factory=list)
max_partitions: int = 10_000
@dataclass
@ -384,6 +396,65 @@ def load_config(path: Path) -> LoaderConfig:
if exclude is not None and not isinstance(exclude, list):
raise ValueError(f"Config {path}: 'exclude' must be a list of column names.")
# -- partition_by -------------------------------------------------------
raw_pb = raw.get("partition_by")
if raw_pb is None or (isinstance(raw_pb, list) and len(raw_pb) == 0):
partition_by: List[str] = []
elif isinstance(raw_pb, str):
if not raw_pb.strip():
raise ValueError(f"Config {path}: 'partition_by' string must be non-empty.")
partition_by = [raw_pb.strip()]
elif isinstance(raw_pb, list):
partition_by = []
for i, item in enumerate(raw_pb):
if not isinstance(item, str) or not item.strip():
raise ValueError(
f"Config {path}: 'partition_by[{i}]' must be a non-empty string."
)
partition_by.append(str(item).strip())
if len(partition_by) != len(set(partition_by)):
raise ValueError(
f"Config {path}: 'partition_by' contains duplicate column names."
)
else:
raise ValueError(
f"Config {path}: 'partition_by' must be a string or list of strings."
)
# Validate partition_by vs include/exclude
if partition_by:
inc_list = [str(c) for c in include] if include is not None else None
exc_list = [str(c) for c in exclude] if exclude is not None else None
if inc_list is not None:
missing_in_include = [c for c in partition_by if c not in inc_list]
if missing_in_include:
raise ValueError(
f"Config {path}: 'include' omits partition_by columns: "
f"{missing_in_include}"
)
if exc_list is not None:
excluded_parts = [c for c in partition_by if c in exc_list]
if excluded_parts:
raise ValueError(
f"Config {path}: 'exclude' removes partition_by columns: "
f"{excluded_parts}"
)
# -- max_partitions -----------------------------------------------------
raw_mp = raw.get("max_partitions", 10_000)
try:
max_partitions = int(raw_mp)
except (TypeError, ValueError):
raise ValueError(
f"Config {path}: 'max_partitions' must be a positive integer, "
f"got {raw_mp!r}"
)
if max_partitions <= 0:
raise ValueError(
f"Config {path}: 'max_partitions' must be a positive integer, "
f"got {max_partitions}"
)
return LoaderConfig(
filename=filename,
schemaname=schemaname,
@ -391,6 +462,8 @@ def load_config(path: Path) -> LoaderConfig:
if_exists=if_exists,
include=[str(c) for c in include] if include is not None else None,
exclude=[str(c) for c in exclude] if exclude is not None else None,
partition_by=partition_by,
max_partitions=max_partitions,
)
@ -753,24 +826,48 @@ def _table_exists(conn, schema: str, table: str) -> bool:
return cur.fetchone() is not None
def render_create_table(schema: str, table: str, columns: Dict[str, ColumnSpec]) -> str:
def render_create_table(
schema: str,
table: str,
columns: Dict[str, ColumnSpec],
*,
partition_by: Optional[List[str]] = None,
) -> str:
"""Render a ``CREATE TABLE`` statement.
When ``partition_by`` is provided and non-empty, appends a
``PARTITION BY LIST ("first_field")`` clause to the statement.
"""
lines = []
for spec in columns.values():
null_clause = "" if spec.nullable else " NOT NULL"
lines.append(f" {_quote_ident(spec.name)} {spec.postgres_type}{null_clause}")
body = ",\n".join(lines)
return f"CREATE TABLE {_qualified(schema, table)} (\n{body}\n);"
suffix = ""
if partition_by:
suffix = f"\nPARTITION BY LIST ({_quote_ident(partition_by[0])})"
return f"CREATE TABLE {_qualified(schema, table)} (\n{body}\n){suffix};"
def _create_table_sql(conn, schema: str, table: str, columns: Dict[str, ColumnSpec]) -> None:
sql = render_create_table(schema, table, columns)
def _create_table_sql(
conn,
schema: str,
table: str,
columns: Dict[str, ColumnSpec],
*,
partition_by: Optional[List[str]] = None,
) -> None:
"""Execute a ``CREATE TABLE`` statement, optionally with partitioning."""
sql = render_create_table(schema, table, columns, partition_by=partition_by)
with conn.cursor() as cur:
cur.execute(sql)
def _drop_table(conn, schema: str, table: str) -> None:
def _drop_table(conn, schema: str, table: str, *, cascade: bool = False) -> None:
"""Drop a table, optionally with CASCADE for partitioned tables."""
tail = " CASCADE" if cascade else ""
with conn.cursor() as cur:
cur.execute(f"DROP TABLE {_qualified(schema, table)}")
cur.execute(f"DROP TABLE {_qualified(schema, table)}{tail}")
# Normalization table: map both loader-emitted and Postgres-reported type
@ -815,8 +912,6 @@ def _normalize_type(pg_type: str) -> str:
stripped = pg_type.strip().upper()
# Remove trailing (n) / (p,s) before the space-separated tail.
# Examples: "VARCHAR(10)" -> "VARCHAR"; "TIMESTAMP(6) WITHOUT TIME ZONE" -> "TIMESTAMP WITHOUT TIME ZONE"
import re
stripped = re.sub(r"\(\s*\d+\s*(?:,\s*\d+\s*)?\)", "", stripped).strip()
# Collapse doubled whitespace after paren removal.
stripped = re.sub(r"\s+", " ", stripped)
@ -893,11 +988,28 @@ def create_table(
table_name: str,
columns: Dict[str, ColumnSpec],
if_exists: str,
*,
partition_by: Optional[List[str]] = None,
partition_values: Optional[dict] = None,
max_partitions: int = 10_000,
) -> None:
"""Create (or verify) the target table according to ``if_exists``."""
"""Create (or verify) the target table according to ``if_exists``.
When ``partition_by`` is provided and non-empty, the parent table is
created with ``PARTITION BY LIST`` and all child partition DDL from
:func:`render_partition_ddl` is executed immediately after.
For ``replace`` mode the existing table is dropped with ``CASCADE`` so
all child partitions are removed automatically.
For ``append`` mode partition creation is skipped entirely the
partitions are assumed to already exist from the original creation.
"""
if if_exists not in VALID_IF_EXISTS:
raise ValueError(f"if_exists must be one of {VALID_IF_EXISTS}, got {if_exists!r}")
is_partitioned = bool(partition_by)
exists = _table_exists(conn, schema_name, table_name)
if exists:
if if_exists == "fail":
@ -905,14 +1017,502 @@ def create_table(
f"Table {schema_name}.{table_name} already exists and if_exists=fail"
)
if if_exists == "replace":
_drop_table(conn, schema_name, table_name)
_create_table_sql(conn, schema_name, table_name, columns)
_drop_table(conn, schema_name, table_name, cascade=is_partitioned)
_create_table_sql(
conn, schema_name, table_name, columns,
partition_by=partition_by,
)
if is_partitioned and partition_values is not None:
ddl_stmts = render_partition_ddl(
schema_name, table_name, partition_by, partition_values,
columns, max_partitions=max_partitions,
)
with conn.cursor() as cur:
for stmt in ddl_stmts:
cur.execute(stmt)
return
if if_exists == "append":
_assert_schema_compatible(conn, schema_name, table_name, columns)
return
else:
_create_table_sql(conn, schema_name, table_name, columns)
_create_table_sql(
conn, schema_name, table_name, columns,
partition_by=partition_by,
)
if is_partitioned and partition_values is not None:
ddl_stmts = render_partition_ddl(
schema_name, table_name, partition_by, partition_values,
columns, max_partitions=max_partitions,
)
with conn.cursor() as cur:
for stmt in ddl_stmts:
cur.execute(stmt)
# ---------------------------------------------------------------------------
# Partition support
# ---------------------------------------------------------------------------
def _sanitize_partition_value(value: Any, parent_table: str = "") -> str:
"""Convert a partition value into a safe, deterministic table-name suffix.
Rules:
- Convert to string, lowercase
- Replace non-alphanumeric runs with ``_``
- Collapse consecutive underscores, strip leading/trailing ``_``
- None/NaN ``null``; empty string ``empty``
- Truncate to fit within PostgreSQL's 63-character identifier limit
accounting for ``parent_table`` + ``_`` separator
"""
if value is None or (isinstance(value, float) and (pd.isna(value) or math.isnan(value))):
token = "null"
elif isinstance(value, dt.date) or isinstance(value, dt.datetime):
token = value.isoformat()
elif isinstance(value, dt.time):
token = value.isoformat()
else:
token = str(value)
token = token.lower()
token = re.sub(r"[^a-z0-9]+", "_", token)
token = re.sub(r"_+", "_", token)
token = token.strip("_")
if not token:
if value is None or (isinstance(value, float) and pd.isna(value)):
token = "null"
elif isinstance(value, str) and value == "":
token = "empty"
else:
token = "value"
# Truncate to keep total table name within PG's 63-char limit.
if parent_table:
# Reserve room for parent + underscore separator.
max_token_len = _PG_IDENT_MAX_LEN - len(parent_table) - 1
if max_token_len < 1:
raise ValueError(
f"Parent table name {parent_table!r} is too long "
f"({len(parent_table)} chars) to create child partitions."
)
if len(token) > max_token_len:
token = token[:max_token_len].rstrip("_")
return token
def _render_partition_value_literal(value: Any, pg_type: str) -> str:
"""Render a Python value as a SQL literal for ``FOR VALUES IN (...)``.
- None/NaN ``NULL``
- Strings single-quoted with ``'`` escaped to ``''``
- Numbers plain numeric literal
- Booleans ``TRUE`` / ``FALSE``
- Dates ``DATE 'YYYY-MM-DD'``
- Timestamps ``TIMESTAMP 'YYYY-MM-DD HH:MM:SS'``
- Times ``TIME 'HH:MM:SS'``
"""
if value is None or (isinstance(value, float) and pd.isna(value)):
return "NULL"
pg_upper = pg_type.upper()
if pg_upper in ("BOOLEAN", "BOOL"):
return "TRUE" if value else "FALSE"
if pg_upper in ("INTEGER", "BIGINT", "SMALLINT", "INT", "INT4", "INT8", "INT2"):
return str(int(value))
if pg_upper in ("DOUBLE PRECISION", "REAL", "NUMERIC", "DECIMAL",
"FLOAT4", "FLOAT8"):
return str(value)
if pg_upper == "DATE":
if isinstance(value, (dt.date, dt.datetime)):
return f"DATE '{value.isoformat()}'"
return f"DATE '{value}'"
if pg_upper in ("TIMESTAMP", "TIMESTAMP WITHOUT TIME ZONE",
"TIMESTAMP WITH TIME ZONE", "TIMESTAMPTZ"):
if isinstance(value, (dt.datetime, pd.Timestamp)):
return f"TIMESTAMP '{value.isoformat()}'"
if isinstance(value, dt.date):
return f"TIMESTAMP '{dt.datetime(value.year, value.month, value.day).isoformat()}'"
return f"TIMESTAMP '{value}'"
if pg_upper in ("TIME", "TIME WITHOUT TIME ZONE",
"TIME WITH TIME ZONE", "TIMETZ"):
if isinstance(value, dt.time):
return f"TIME '{value.isoformat()}'"
return f"TIME '{value}'"
# Default: treat as text — single-quote with escaping.
escaped = str(value).replace("'", "''")
return f"'{escaped}'"
def _normalize_partition_value(value: Any, pg_type: str) -> Any:
"""Normalize a raw partition value to its Python-native form.
Applies the same semantic normalization that :func:`_prepare_for_copy`
uses, so partition discovery deduplicates on the routed value rather
than the raw source representation.
"""
# Handle pandas null types
if value is None:
return None
if isinstance(value, float) and (pd.isna(value) or math.isnan(value)):
return None
try:
if pd.isna(value):
return None
except (TypeError, ValueError):
pass
pg_upper = pg_type.upper()
if pg_upper in ("INTEGER", "BIGINT", "SMALLINT", "INT", "INT4", "INT8", "INT2"):
if isinstance(value, str):
value = value.strip()
if value == "":
return None
try:
return int(float(value))
except (TypeError, ValueError):
return None
if pg_upper in ("DOUBLE PRECISION", "REAL", "NUMERIC", "DECIMAL",
"FLOAT4", "FLOAT8"):
if isinstance(value, str):
value = value.strip()
if value == "":
return None
try:
result = float(value)
return None if math.isnan(result) else result
except (TypeError, ValueError):
return None
if pg_upper == "DATE":
if isinstance(value, dt.datetime):
return value.date()
if isinstance(value, dt.date):
return value
if isinstance(value, str):
if value.strip() == "":
return None
try:
return dt.date.fromisoformat(value.strip())
except (ValueError, TypeError):
return None
return None
if pg_upper in ("TIMESTAMP", "TIMESTAMP WITHOUT TIME ZONE",
"TIMESTAMP WITH TIME ZONE", "TIMESTAMPTZ"):
if isinstance(value, dt.datetime):
return value
if isinstance(value, pd.Timestamp):
return value.to_pydatetime() if not pd.isna(value) else None
if isinstance(value, dt.date):
return dt.datetime(value.year, value.month, value.day)
if isinstance(value, str):
if value.strip() == "":
return None
try:
return dt.datetime.fromisoformat(value.strip())
except (ValueError, TypeError):
return None
return None
if pg_upper in ("TIME", "TIME WITHOUT TIME ZONE",
"TIME WITH TIME ZONE", "TIMETZ"):
return _seconds_to_time(value)
if pg_upper in ("BOOLEAN", "BOOL"):
if isinstance(value, bool):
return value
if isinstance(value, (int, float)):
return bool(value)
if isinstance(value, str):
return value.strip().lower() in ("true", "1", "t", "yes")
return None
# Text-like types: None, pandas nulls, and '' all become None
# because copy_dataframes() sends empty strings with NULL ''.
if pg_upper in ("TEXT", "VARCHAR", "CHARACTER VARYING", "CHAR", "CHARACTER", "BPCHAR"):
if isinstance(value, str):
if value == "":
return None
return value
return str(value)
# Fallback: return as-is converted to native Python type
if hasattr(value, "item"):
return value.item()
return value
def discover_partition_values(
df: pd.DataFrame,
partition_by: list[str],
columns: Optional[Dict[str, ColumnSpec]] = None,
) -> dict:
"""Build a nested structure of unique partition values from a DataFrame.
For ``partition_by = ['state', 'zip']`` returns::
{
'MO': {'63101': {}, '63102': {}},
'IL': {'62001': {}, '62002': {}}
}
When ``columns`` is provided, values are normalized through
:func:`_normalize_partition_value` to match the routed values Postgres
will see during ``COPY``.
None/NaN values are included as a distinct partition value (``None`` key).
Values are converted to Python native types (not numpy types).
"""
if not partition_by:
return {}
def _to_native(val: Any) -> Any:
"""Convert numpy scalars to Python native types."""
if val is None:
return None
if isinstance(val, float) and pd.isna(val):
return None
if hasattr(val, "item"):
return val.item()
return val
def _build_level(sub_df: pd.DataFrame, fields: list[str]) -> dict:
if not fields or sub_df.empty:
return {}
field = fields[0]
remaining = fields[1:]
result: dict = {}
# Get unique values, handling NaN
unique_vals = sub_df[field].unique()
for raw_val in unique_vals:
val = _to_native(raw_val)
# Normalize if column spec is available
if columns and field in columns:
val = _normalize_partition_value(val, columns[field].postgres_type)
if remaining:
# Filter rows matching this value
if val is None:
mask = sub_df[field].isna() | sub_df[field].map(
lambda v: v is None or (isinstance(v, float) and pd.isna(v))
or (isinstance(v, str) and v == ""
and columns and field in columns
and columns[field].postgres_type.upper() in (
"TEXT", "VARCHAR", "CHARACTER VARYING",
"CHAR", "CHARACTER", "BPCHAR"))
)
else:
mask = sub_df[field].map(lambda v, target=val: _matches(v, target, field))
child_df = sub_df[mask]
result[val] = _build_level(child_df, remaining)
else:
result[val] = {}
return result
def _matches(raw_val: Any, target: Any, field_name: str) -> bool:
"""Check if a raw value normalizes to the target."""
native = _to_native(raw_val)
if columns and field_name in columns:
native = _normalize_partition_value(native, columns[field_name].postgres_type)
if target is None:
return native is None
return native == target
return _build_level(df, list(partition_by))
def discover_partition_values_chunked(
chunk_iter: Iterable[pd.DataFrame],
partition_by: list[str],
columns: Optional[Dict[str, ColumnSpec]] = None,
) -> dict:
"""Discover partition values across an iterable of DataFrame chunks.
Scans the entire file chunk-by-chunk, collecting unique partition
column values and merging them into a single nested partition tree.
This avoids materializing the full file in memory.
"""
if not partition_by:
return {}
merged: dict = {}
for chunk_df in chunk_iter:
if chunk_df.empty:
continue
# Only keep partition columns to minimize memory
part_cols = [c for c in partition_by if c in chunk_df.columns]
if len(part_cols) != len(partition_by):
missing = [c for c in partition_by if c not in chunk_df.columns]
raise ValueError(
f"Partition columns not found in data: {missing}"
)
sub_df = chunk_df[part_cols]
chunk_tree = discover_partition_values(sub_df, partition_by, columns)
_merge_partition_trees(merged, chunk_tree)
return merged
def _merge_partition_trees(target: dict, source: dict) -> None:
"""Merge ``source`` partition tree into ``target`` in place.
Both trees are nested dicts where keys are partition values and values
are either empty dicts (leaf) or nested dicts (intermediate levels).
"""
for key, subtree in source.items():
if key not in target:
target[key] = subtree
else:
# Merge children recursively
if subtree and target[key]:
_merge_partition_trees(target[key], subtree)
elif subtree:
target[key] = subtree
def _count_partitions(tree: dict) -> int:
"""Count total partition tables in a nested partition tree."""
count = 0
for _key, children in tree.items():
count += 1
if children:
count += _count_partitions(children)
return count
def render_partition_ddl(
schema: str,
parent_table: str,
partition_by: list[str],
partition_values: dict,
column_specs: Dict[str, ColumnSpec],
*,
max_partitions: int = 10_000,
) -> list[str]:
"""Generate all child partition DDL statements for the partition tree.
Returns a list of SQL strings to execute in order (depth-first).
The parent ``CREATE TABLE`` is NOT included it is rendered separately
by :func:`render_create_table`.
Logs a warning if the total partition count exceeds ``max_partitions``,
but continues.
"""
if not partition_by or not partition_values:
return []
total = _count_partitions(partition_values)
if total > max_partitions:
logger.warning(
"Partition count (%d) exceeds threshold (%d). "
"This may impact database performance.",
total, max_partitions,
)
print(
f"[warn] partition plan for {schema}.{parent_table} will create "
f"{total:,} partition tables, exceeding max_partitions={max_partitions:,}",
file=sys.stderr,
)
# Track used child names at each parent level to detect collisions
statements: list[str] = []
_render_partition_ddl_recursive(
schema, parent_table, partition_by, partition_values,
column_specs, 0, statements,
)
return statements
def _render_partition_ddl_recursive(
schema: str,
parent_table: str,
partition_by: list[str],
values: dict,
column_specs: Dict[str, ColumnSpec],
depth: int,
statements: list[str],
) -> None:
"""Recursively generate partition DDL statements (depth-first)."""
field_name = partition_by[depth]
next_field = partition_by[depth + 1] if depth + 1 < len(partition_by) else None
field_spec = column_specs.get(field_name)
pg_type = field_spec.postgres_type if field_spec else "TEXT"
# Track names used at this level under this parent to handle collisions
used_names: Dict[str, Any] = {}
# Sort values deterministically: None first, then by string representation
def _sort_key(val: Any) -> Tuple[int, str]:
if val is None:
return (0, "")
return (1, str(val))
sorted_values = sorted(values.keys(), key=_sort_key)
for val in sorted_values:
children = values[val]
token = _sanitize_partition_value(val, parent_table)
child_name = f"{parent_table}_{token}"
# Handle collisions
if child_name in used_names and used_names[child_name] is not val:
# Append a short hash of the value to disambiguate
val_hash = hashlib.sha256(repr(val).encode()).hexdigest()[:8]
# Re-truncate token to make room for _hash
max_token_len = _PG_IDENT_MAX_LEN - len(parent_table) - 1 - 9 # _hash8
if max_token_len < 1:
max_token_len = 1
truncated_token = token[:max_token_len].rstrip("_")
child_name = f"{parent_table}_{truncated_token}_{val_hash}"
# Final length check
if len(child_name) > _PG_IDENT_MAX_LEN:
child_name = child_name[:_PG_IDENT_MAX_LEN]
used_names[child_name] = val
literal = _render_partition_value_literal(val, pg_type)
if next_field is not None:
# Intermediate partition: itself partitioned by the next field
stmt = (
f"CREATE TABLE {_qualified(schema, child_name)} "
f"PARTITION OF {_qualified(schema, parent_table)} "
f"FOR VALUES IN ({literal}) "
f"PARTITION BY LIST ({_quote_ident(next_field)});"
)
statements.append(stmt)
# Recurse into children
if children:
_render_partition_ddl_recursive(
schema, child_name, partition_by, children,
column_specs, depth + 1, statements,
)
else:
# Leaf partition
stmt = (
f"CREATE TABLE {_qualified(schema, child_name)} "
f"PARTITION OF {_qualified(schema, parent_table)} "
f"FOR VALUES IN ({literal});"
)
statements.append(stmt)
# ---------------------------------------------------------------------------
@ -1208,6 +1808,15 @@ def main(argv: Optional[List[str]] = None) -> int:
preview_df = apply_column_filter(preview_df, cfg.include, cfg.exclude)
columns = infer_schema(preview_df, meta)
# Validate partition columns exist in the schema after filtering.
if cfg.partition_by:
missing_pcols = [c for c in cfg.partition_by if c not in columns]
if missing_pcols:
raise ValueError(
f"partition_by references columns not present in the "
f"(filtered) schema: {missing_pcols}"
)
if args.validate:
manifest_path = cfg.filename.with_suffix("").with_suffix(".expected.json")
# The above strips .xpt then appends .expected.json, e.g.
@ -1220,8 +1829,51 @@ def main(argv: Optional[List[str]] = None) -> int:
return 1
print(f"validation OK ({len(columns)} columns match {manifest_path.name})")
# -- Partition value discovery ------------------------------------------
# If partitioned, scan the ENTIRE file to discover all unique partition
# values. The preview is only the first N rows and may miss values.
# In append mode the partitions already exist, so skip the costly scan.
partition_values: Optional[dict] = None
if cfg.partition_by and cfg.if_exists != "append":
print(" discovering partition values (full file scan)...", file=sys.stderr)
def _discovery_chunks():
for chunk_df, _chunk_meta in iter_sas_chunks(cfg.filename):
yield apply_column_filter(chunk_df, cfg.include, cfg.exclude)
partition_values = discover_partition_values_chunked(
_discovery_chunks(), cfg.partition_by, columns,
)
total_parts = _count_partitions(partition_values)
print(
f" discovered {total_parts:,} partition tables "
f"across {len(cfg.partition_by)} level(s)",
file=sys.stderr,
)
elif cfg.partition_by and cfg.if_exists == "append":
print(
" [info] append mode: skipping partition discovery "
"(partitions assumed to exist)",
file=sys.stderr,
)
if args.dry_run:
print(render_create_table(cfg.schemaname, cfg.tablename, columns))
# Print the parent CREATE TABLE (with PARTITION BY if applicable).
parent_ddl = render_create_table(
cfg.schemaname, cfg.tablename, columns,
partition_by=cfg.partition_by or None,
)
print(parent_ddl)
# Print child partition DDL if partitioned.
if cfg.partition_by and partition_values:
child_stmts = render_partition_ddl(
cfg.schemaname, cfg.tablename, cfg.partition_by,
partition_values, columns,
max_partitions=cfg.max_partitions,
)
for stmt in child_stmts:
print()
print(stmt)
return 0
# Release the preview frame before opening the stream - lets the GC reclaim
@ -1244,7 +1896,12 @@ def main(argv: Optional[List[str]] = None) -> int:
conn = connect(user=db_user, password=db_password)
conn.autocommit = False
try:
create_table(conn, cfg.schemaname, cfg.tablename, columns, cfg.if_exists)
create_table(
conn, cfg.schemaname, cfg.tablename, columns, cfg.if_exists,
partition_by=cfg.partition_by or None,
partition_values=partition_values,
max_partitions=cfg.max_partitions,
)
inserted = copy_dataframes(
conn, cfg.schemaname, cfg.tablename, _filtered_chunks(), columns
)
@ -1259,6 +1916,9 @@ def main(argv: Optional[List[str]] = None) -> int:
f"loaded {inserted} rows into {cfg.schemaname}.{cfg.tablename} "
f"({len(columns)} columns)"
)
if cfg.partition_by and partition_values:
total_parts = _count_partitions(partition_values)
print(f"partitioned by {cfg.partition_by} ({total_parts:,} partition tables)")
print("final schema:")
print(_format_columns_summary(columns))
return 0

View File

@ -15,3 +15,17 @@ tablename: kitchensink
# What to do if the target table already exists: fail | replace | append
# Defaults to fail.
if_exists: append
# partition_by: Partition the table by unique values of these columns.
# Columns are applied in cascading order (first column = top-level partition).
# Requires if_exists: replace or fail (not append for initial creation).
# Single field:
# partition_by: state
# Multiple fields (cascading):
# partition_by:
# - state
# - zip
#
# max_partitions: Warning threshold for total partition count (default: 10000).
# If the number of partitions exceeds this, a warning is logged but loading continues.
# max_partitions: 10000

View File

@ -31,6 +31,20 @@ auto_detect: true
# exclude:
# - ALLNULL
# Folder-level partition_by: Partition every cluster's table by unique values
# of these columns. Inherited by all clusters unless overridden per-cluster.
# Requires if_exists: replace or fail (not append for initial creation).
# Single field:
# partition_by: state
# Multiple fields (cascading):
# partition_by:
# - state
# - zip
#
# Folder-level max_partitions: Warning threshold for total partition count
# (default: 10000). Inherited by all clusters unless overridden per-cluster.
# max_partitions: 10000
# Explicit cluster patterns. Each pattern is matched against the file
# *basename*. Files matched by a pattern are pulled out of the auto-detect
# pool, so explicit and auto clusters compose cleanly.
@ -48,6 +62,16 @@ clusters:
# tablename: group_b
# if_exists: append
# Per-cluster partition_by / max_partitions override. These take precedence
# over the folder-level defaults above.
#
# - pattern: '^group_c\d+\.xpt$'
# tablename: group_c
# partition_by:
# - region
# - year
# max_partitions: 500
# With only the gq pattern explicit, auto_detect: true will still bucket
# group_b1.xpt + group_b2.xpt into a "group_b" cluster and the lone
# standalone.xpt into a "standalone" cluster. See generate_sample_folder.py