foxtrot

Author	SHA1	Message	Date
David Peterson	212218fb67	Enhance error handling and abort functionality in load_folder.py for parallel file loading Implemented an `--abort-on-first-failure` option in the `_load_remaining_files_parallel` function, allowing users to cancel all pending tasks immediately upon the first worker failure. This change improves user experience by providing real-time feedback on errors through stderr, ensuring that users are promptly informed of issues without waiting for all tasks to complete. Additionally, refined error reporting to maintain accurate summaries of successes and failures, even during interruptions.	2026-04-21 12:54:05 -05:00
David Peterson	ae65140390	Add column type overrides in load_folder.py and load_sas.py for enhanced schema control Implemented a new feature allowing users to specify explicit column type mappings via a `column_types` configuration in both `load_folder.py` and `load_sas.py`. This addition enables users to bypass automatic type inference for specific columns, ensuring correct data types are used when loading datasets. Updated the YAML configuration files to include examples of the new `column_types` option, enhancing usability and flexibility in handling varying data formats across files.	2026-04-21 12:14:44 -05:00
David Peterson	0c5e6e31f0	Enhance memory management in load_folder.py and load_sas.py for improved performance Added memory management optimizations in the _worker_load_append_file function to release unused memory from pyarrow's pool and trigger Python's garbage collection. Implemented explicit memory trimming using glibc's malloc_trim to ensure efficient memory usage during long-running processes. Updated the copy_dataframes function in load_sas.py to release pyarrow's memory pool between chunks, preventing high memory usage in long-lived workers. These changes aim to reduce memory footprint and improve overall performance during large dataset processing.	2026-04-21 10:46:54 -05:00
David Peterson	9afb52aecb	Add --chunk-rows option to load_folder.py for customizable memory management Introduced a new command-line argument, --chunk-rows, allowing users to specify the number of rows per chunk for pyreadstat streaming and COPY operations. This option overrides the GENERIC_LOADER_CHUNK_ROWS environment variable and auto-scaling behavior when using multiple workers. Enhanced memory management by providing detailed information on peak memory usage based on the specified chunk size, improving performance and usability during large dataset processing.	2026-04-21 10:05:21 -05:00
David Peterson	eac75cbb26	Refactor load_cluster function in load_folder.py for improved parallel file loading Updated the load_cluster function to enhance parallel processing by committing the table creation before dispatching all files to worker processes. This change allows for more efficient handling of large datasets by reducing the serial workload and ensuring schema compatibility checks can access the committed table. The logic for streaming files has been clarified, maintaining progress tracking throughout the loading process.	2026-04-21 08:31:48 -05:00
David Peterson	2dd247b067	Add --no-prescan option to load_folder.py for skipping metadata scan Introduced a new command-line argument, --no-prescan, allowing users to bypass the per-file metadata scan during the loading process. This enhancement is particularly useful for large folders where the pre-scan may be time-consuming. The progress bar will still display rows loaded, rate, and elapsed time, but without an estimated time of arrival (ETA) for completion. Updated the main function to handle this new option and adjusted the progress tracking accordingly.	2026-04-21 08:12:39 -05:00
David Peterson	052fb0e087	Refactor pre-scan process in load_folder.py to utilize ThreadPoolExecutor for improved performance Updated the main function to replace sequential file processing with a threaded approach using ThreadPoolExecutor. This change enhances the efficiency of reading row counts from SAS files, particularly for large datasets, by allowing concurrent I/O operations. Added progress tracking with tqdm for better user feedback during the pre-scan phase.	2026-04-20 22:43:02 -05:00
David Peterson	fe7dc4d5a1	Enhance load_cluster function for parallel processing and progress tracking Refactored the load_cluster function in load_folder.py to support parallel file loading using ProcessPoolExecutor, improving performance during the append phase. Added workers parameter for controlling parallelism and integrated a progress_queue for real-time progress updates. Introduced read_sas_metadata function in load_sas.py to efficiently read metadata from SAS files, optimizing the pre-scan process for global progress tracking.	2026-04-20 22:02:55 -05:00
David Peterson	96f2d6fe79	Update requirements and enhance SAS file processing with progress tracking Updated the pyarrow version in requirements.txt to improve compatibility. Enhanced the _infer_cluster_schema and _stream_file functions in load_folder.py and load_sas.py to return total row counts for better progress tracking during data streaming. Integrated tqdm for visual feedback on row processing, improving user experience during large data loads.	2026-04-20 21:44:49 -05:00
David Peterson	b78f6d648f	Enhance file clustering by implementing numeric sorting for last digit groups in stems and updating documentation for embedded-digit handling in auto-detection.	2026-04-20 11:48:22 -05:00
michael-corey	b3d7a9d440	adding index field	2026-04-20 10:18:09 -05:00
michael-corey	0d955eeab1	adding partition flag	2026-04-20 09:56:00 -05:00
michael-corey	e39eb47a90	altering such that commit is by batch	2026-04-20 08:38:38 -05:00
michael-corey	1bbe0d4cd6	removing latin encoding, adding usage notes	2026-04-18 13:06:01 -05:00
michael-corey	3b913b2ca6	adding user prompt for db creds	2026-04-18 12:37:22 -05:00
David Peterson	5b48872dd7	Add generate_sample_folder.py and load_folder.py for clustered SAS file generation and loading Introduce generate_sample_folder.py to create a test folder with clustered SAS XPORT files, including configurations for schema compatibility checks. Implement load_folder.py to facilitate loading entire directories of SAS files into Postgres, supporting explicit and auto-detect clustering. Update sample_folder_config.yaml for usage examples and configuration structure. Enhance load_sas.py with a public schema compatibility check function for orchestrators.	2026-04-18 11:25:04 -05:00

16 Commits