Commit Graph

27 Commits

Author SHA1 Message Date
David Peterson
0632e110e5 Implement parallel processing for partition discovery in load_folder.py and enhance column filtering in load_sas.py
Added support for parallel processing using ProcessPoolExecutor in the _discover_cluster_partitions function, allowing for efficient partition value discovery across multiple files. This change significantly reduces I/O overhead by reading only necessary columns during scans. Additionally, updated iter_sas_chunks and iter_text_chunks functions to accept a usecols parameter, enabling selective column parsing for improved performance during data loading. These enhancements aim to optimize resource usage and speed up the data processing pipeline.
2026-04-22 15:35:19 +00:00
michael-corey
1197846d10 adding text file support 2026-04-21 20:05:26 -05:00
David Peterson
64e7ff0b0a Enhance error reporting in load_folder.py and load_sas.py for better debugging
Updated error handling in the _worker_load_append_file function to include full tracebacks in exception messages, improving context for failures during file loading. Additionally, modified the _safe_numeric_to_datetime function to provide detailed warnings when conversion errors occur, ensuring users are informed of potential data issues. These changes aim to facilitate easier debugging and enhance the robustness of the data loading process.
2026-04-21 16:56:27 -05:00
David Peterson
eff82c73ce Add all_nullable configuration option in load_folder.py and load_sas.py for flexible schema management
Introduced an `all_nullable` boolean option in both `load_folder.py` and `load_sas.py`, allowing users to specify whether all columns should be treated as nullable during schema inference. This feature addresses scenarios where the data sampling may incorrectly suggest that columns are non-nullable, preventing potential errors during data loading. Updated YAML configuration files to include examples of this new option, enhancing usability and providing clearer documentation for users.
2026-04-21 16:48:37 -05:00
David Peterson
c283b42876 Add safe numeric to datetime conversion in load_sas.py to handle edge cases
Implemented the _safe_numeric_to_datetime function to convert numeric SAS-epoch series to datetime64[ns] while managing potential overflow and non-finite values. This enhancement improves error handling during data processing by masking invalid entries before conversion, ensuring robust handling of SAS date formats in the _prepare_for_copy function.
2026-04-21 15:55:25 -05:00
David Peterson
a46f0518f6 Suppress PerformanceWarning in load_sas.py to reduce noise during processing of wide SAS files. This change filters out warnings related to DataFrame fragmentation, which are irrelevant for our pipeline as we directly convert DataFrames to pyarrow tables. 2026-04-21 13:40:38 -05:00
David Peterson
969a442775 Refactor numeric column type inference in load_sas.py for improved data handling
Updated the logic for determining column types in the union_column_types function. Changed the default type from BIGINT to DOUBLE PRECISION for numeric columns without explicit format hints, ensuring better handling of both integer and float values. This adjustment prevents loading failures due to format discrepancies and maintains consistent data processing across various SAS formats.
2026-04-21 13:17:01 -05:00
David Peterson
ae65140390 Add column type overrides in load_folder.py and load_sas.py for enhanced schema control
Implemented a new feature allowing users to specify explicit column type mappings via a `column_types` configuration in both `load_folder.py` and `load_sas.py`. This addition enables users to bypass automatic type inference for specific columns, ensuring correct data types are used when loading datasets. Updated the YAML configuration files to include examples of the new `column_types` option, enhancing usability and flexibility in handling varying data formats across files.
2026-04-21 12:14:44 -05:00
David Peterson
0c5e6e31f0 Enhance memory management in load_folder.py and load_sas.py for improved performance
Added memory management optimizations in the _worker_load_append_file function to release unused memory from pyarrow's pool and trigger Python's garbage collection. Implemented explicit memory trimming using glibc's malloc_trim to ensure efficient memory usage during long-running processes. Updated the copy_dataframes function in load_sas.py to release pyarrow's memory pool between chunks, preventing high memory usage in long-lived workers. These changes aim to reduce memory footprint and improve overall performance during large dataset processing.
2026-04-21 10:46:54 -05:00
David Peterson
9afb52aecb Add --chunk-rows option to load_folder.py for customizable memory management
Introduced a new command-line argument, --chunk-rows, allowing users to specify the number of rows per chunk for pyreadstat streaming and COPY operations. This option overrides the GENERIC_LOADER_CHUNK_ROWS environment variable and auto-scaling behavior when using multiple workers. Enhanced memory management by providing detailed information on peak memory usage based on the specified chunk size, improving performance and usability during large dataset processing.
2026-04-21 10:05:21 -05:00
David Peterson
1265489276 Enhance date and timestamp handling in _prepare_for_copy function in load_sas.py
Added support for numeric date and datetime conversions from SAS formats. Implemented logic to handle float64 representations of dates (days since 1960-01-01) and datetimes (seconds since 1960-01-01), ensuring proper parsing and preventing errors during data copying to Postgres. This enhancement improves compatibility with various SAS date formats.
2026-04-21 08:16:17 -05:00
David Peterson
fe7dc4d5a1 Enhance load_cluster function for parallel processing and progress tracking
Refactored the load_cluster function in load_folder.py to support parallel file loading using ProcessPoolExecutor, improving performance during the append phase. Added workers parameter for controlling parallelism and integrated a progress_queue for real-time progress updates. Introduced read_sas_metadata function in load_sas.py to efficiently read metadata from SAS files, optimizing the pre-scan process for global progress tracking.
2026-04-20 22:02:55 -05:00
David Peterson
96f2d6fe79 Update requirements and enhance SAS file processing with progress tracking
Updated the pyarrow version in requirements.txt to improve compatibility. Enhanced the _infer_cluster_schema and _stream_file functions in load_folder.py and load_sas.py to return total row counts for better progress tracking during data streaming. Integrated tqdm for visual feedback on row processing, improving user experience during large data loads.
2026-04-20 21:44:49 -05:00
David Peterson
7beb44ac4d Add pyarrow dependency and optimize DataFrame serialization in load_sas.py
Included pyarrow as a new dependency in requirements.txt for improved CSV serialization performance. Refactored the _prepare_for_copy function to utilize vectorized operations for date and timestamp conversions, reducing CPU overhead. Introduced a new _serialize_chunk_csv function leveraging pyarrow for faster CSV writing, enhancing efficiency during data copying to Postgres.
2026-04-20 21:32:56 -05:00
David Peterson
5e347f50ef Add widening compatibility checks in load_sas.py for type inference
Introduced a new set of widening compatible type pairs to allow for accepting narrower inferred types when they fit within wider target types during schema compatibility checks. This change enhances the type inference process by preventing unnecessary mismatches and improving handling of varying integer ranges in cluster loads. Updated warning messages to inform users of accepted type adjustments.
2026-04-20 21:08:13 -05:00
David Peterson
f84e127796 Update type inference behavior in load_sas.py to scan entire files by default
Changed the default setting for TYPE_INFERENCE_SAMPLE_ROWS to None, allowing type and nullability inference to consider all rows in a SAS file. This adjustment ensures accurate handling of null values and integer ranges, addressing issues observed in production with large datasets. Updated documentation to reflect the implications of this change and the risks associated with using an integer cap for sampling.
2026-04-20 20:43:27 -05:00
michael-corey
b3d7a9d440 adding index field 2026-04-20 10:18:09 -05:00
michael-corey
0d955eeab1 adding partition flag 2026-04-20 09:56:00 -05:00
michael-corey
e39eb47a90 altering such that commit is by batch 2026-04-20 08:38:38 -05:00
michael-corey
2d95711d9d Updating python reference 2026-04-18 13:43:29 -05:00
michael-corey
1bbe0d4cd6 removing latin encoding, adding usage notes 2026-04-18 13:06:01 -05:00
michael-corey
3b913b2ca6 adding user prompt for db creds 2026-04-18 12:37:22 -05:00
David Peterson
5b48872dd7 Add generate_sample_folder.py and load_folder.py for clustered SAS file generation and loading
Introduce generate_sample_folder.py to create a test folder with clustered SAS XPORT files, including configurations for schema compatibility checks. Implement load_folder.py to facilitate loading entire directories of SAS files into Postgres, supporting explicit and auto-detect clustering. Update sample_folder_config.yaml for usage examples and configuration structure. Enhance load_sas.py with a public schema compatibility check function for orchestrators.
2026-04-18 11:25:04 -05:00
David Peterson
5645ff5597 Update load_sas.py to support streaming data loads with iter_sas_chunks and copy_dataframes. Enhance documentation for schema inference and type detection, clarifying the use of read_sas_preview and the implications of sampling. Add __pycache__ to .gitignore. 2026-04-18 10:44:32 -05:00
David Peterson
3a0537270c Implement type inference sampling in load_sas.py to improve performance on large SAS files. Introduce TYPE_INFERENCE_SAMPLE_ROWS to limit the number of rows scanned for type detection while ensuring nullability checks cover the entire column. Update documentation to reflect these changes. 2026-04-18 10:28:37 -05:00
David Peterson
4f7ded09c6 Enhance load_sas.py with detailed usage instructions, YAML config structure, and command-line interface documentation for loading SAS files. 2026-04-18 10:20:07 -05:00
michael-corey
f681f1012a Adding generic loader 2026-04-18 09:34:48 -05:00