Commit Graph

36 Commits

Author SHA1 Message Date
David Peterson
eac75cbb26 Refactor load_cluster function in load_folder.py for improved parallel file loading
Updated the load_cluster function to enhance parallel processing by committing the table creation before dispatching all files to worker processes. This change allows for more efficient handling of large datasets by reducing the serial workload and ensuring schema compatibility checks can access the committed table. The logic for streaming files has been clarified, maintaining progress tracking throughout the loading process.
2026-04-21 08:31:48 -05:00
David Peterson
1265489276 Enhance date and timestamp handling in _prepare_for_copy function in load_sas.py
Added support for numeric date and datetime conversions from SAS formats. Implemented logic to handle float64 representations of dates (days since 1960-01-01) and datetimes (seconds since 1960-01-01), ensuring proper parsing and preventing errors during data copying to Postgres. This enhancement improves compatibility with various SAS date formats.
2026-04-21 08:16:17 -05:00
David Peterson
2dd247b067 Add --no-prescan option to load_folder.py for skipping metadata scan
Introduced a new command-line argument, --no-prescan, allowing users to bypass the per-file metadata scan during the loading process. This enhancement is particularly useful for large folders where the pre-scan may be time-consuming. The progress bar will still display rows loaded, rate, and elapsed time, but without an estimated time of arrival (ETA) for completion. Updated the main function to handle this new option and adjusted the progress tracking accordingly.
2026-04-21 08:12:39 -05:00
David Peterson
052fb0e087 Refactor pre-scan process in load_folder.py to utilize ThreadPoolExecutor for improved performance
Updated the main function to replace sequential file processing with a threaded approach using ThreadPoolExecutor. This change enhances the efficiency of reading row counts from SAS files, particularly for large datasets, by allowing concurrent I/O operations. Added progress tracking with tqdm for better user feedback during the pre-scan phase.
2026-04-20 22:43:02 -05:00
David Peterson
fe7dc4d5a1 Enhance load_cluster function for parallel processing and progress tracking
Refactored the load_cluster function in load_folder.py to support parallel file loading using ProcessPoolExecutor, improving performance during the append phase. Added workers parameter for controlling parallelism and integrated a progress_queue for real-time progress updates. Introduced read_sas_metadata function in load_sas.py to efficiently read metadata from SAS files, optimizing the pre-scan process for global progress tracking.
2026-04-20 22:02:55 -05:00
David Peterson
96f2d6fe79 Update requirements and enhance SAS file processing with progress tracking
Updated the pyarrow version in requirements.txt to improve compatibility. Enhanced the _infer_cluster_schema and _stream_file functions in load_folder.py and load_sas.py to return total row counts for better progress tracking during data streaming. Integrated tqdm for visual feedback on row processing, improving user experience during large data loads.
2026-04-20 21:44:49 -05:00
David Peterson
7beb44ac4d Add pyarrow dependency and optimize DataFrame serialization in load_sas.py
Included pyarrow as a new dependency in requirements.txt for improved CSV serialization performance. Refactored the _prepare_for_copy function to utilize vectorized operations for date and timestamp conversions, reducing CPU overhead. Introduced a new _serialize_chunk_csv function leveraging pyarrow for faster CSV writing, enhancing efficiency during data copying to Postgres.
2026-04-20 21:32:56 -05:00
David Peterson
5e347f50ef Add widening compatibility checks in load_sas.py for type inference
Introduced a new set of widening compatible type pairs to allow for accepting narrower inferred types when they fit within wider target types during schema compatibility checks. This change enhances the type inference process by preventing unnecessary mismatches and improving handling of varying integer ranges in cluster loads. Updated warning messages to inform users of accepted type adjustments.
2026-04-20 21:08:13 -05:00
David Peterson
f84e127796 Update type inference behavior in load_sas.py to scan entire files by default
Changed the default setting for TYPE_INFERENCE_SAMPLE_ROWS to None, allowing type and nullability inference to consider all rows in a SAS file. This adjustment ensures accurate handling of null values and integer ranges, addressing issues observed in production with large datasets. Updated documentation to reflect the implications of this change and the risks associated with using an integer cap for sampling.
2026-04-20 20:43:27 -05:00
David Peterson
a94ab68f4d Refine partition name patterns in sas_profiler.py
Updated the regular expression for partition name patterns to improve matching accuracy for state-related columns. The new pattern captures variations like `state`, `state_code`, and `statecode` while avoiding false positives from unrelated terms. This change enhances the precision of partition candidate selection.
2026-04-20 19:27:01 -05:00
David Peterson
4fc85081c8 Enhance SAS profiling performance in sas_profiler.py
Added a new constant for profiling chunk size to optimize memory usage during profiling operations. Refactored the update method in the _ColumnStats class to improve efficiency in handling missing values and calculating statistics for numeric and string data types. This update includes vectorized operations for better performance and clarity in the implementation.
2026-04-20 19:03:40 -05:00
David Peterson
5449a25b44 Refactor partition candidate logic in sas_profiler.py
Updated the partition candidate selection process to restrict candidates to columns matching specific name patterns, improving accuracy and reducing noise. Removed outdated distinct value constraints and clarified documentation for partitioning behavior. Enhanced handling of pre-sharded columns and refined the classification logic for better performance.
2026-04-20 18:49:23 -05:00
David Peterson
b3b968edf2 Add openpyxl dependency to requirements.txt for Excel file handling 2026-04-20 18:38:24 -05:00
David Peterson
f1af1136dc Add standalone SAS profiling utility
Introduced a new script `sas_profiler.py` that profiles local SAS files and generates an Excel report with recommendations for drops, partitions, and indexes, along with type-inference warnings. The utility supports command-line overrides for configuration and is compatible with Python 3.10+. This addition enhances the existing tools for SAS file management.
2026-04-20 18:38:01 -05:00
michael-corey
e48038f3c6 updating for sas 2026-04-20 16:30:35 -05:00
michael-corey
2390ce1e0c adding explorer 2026-04-20 16:27:54 -05:00
David Peterson
384103f489 Update pyreadstat version constraint in requirements.txt to allow for version 2.0 2026-04-20 14:10:08 -05:00
David Peterson
03b97999dc Add S3 download utility and example configuration
Introduced a new script `s3_download.py` for downloading files from S3 based on a YAML configuration. The script supports recursive listing, file clustering, and customizable download behavior. Also added a sample configuration file `sample_s3_download_config.yaml` to demonstrate usage.
2026-04-20 13:14:42 -05:00
David Peterson
b78f6d648f Enhance file clustering by implementing numeric sorting for last digit groups in stems and updating documentation for embedded-digit handling in auto-detection. 2026-04-20 11:48:22 -05:00
michael-corey
b3d7a9d440 adding index field 2026-04-20 10:18:09 -05:00
michael-corey
0d955eeab1 adding partition flag 2026-04-20 09:56:00 -05:00
michael-corey
e39eb47a90 altering such that commit is by batch 2026-04-20 08:38:38 -05:00
michael-corey
508cc974ea adding local check 2026-04-20 08:25:27 -05:00
michael-corey
2d95711d9d Updating python reference 2026-04-18 13:43:29 -05:00
michael-corey
f1e99d887d altering invalid arguments 2026-04-18 13:41:54 -05:00
michael-corey
f101eacffd Merging main 2026-04-18 13:39:37 -05:00
michael-corey
edb9146682 moving files 2026-04-18 13:35:32 -05:00
michael-corey
1bbe0d4cd6 removing latin encoding, adding usage notes 2026-04-18 13:06:01 -05:00
David Peterson
c1e1fec10b Update requirements.txt to support new package versions and add boto3 dependency 2026-04-18 12:41:02 -05:00
michael-corey
3b913b2ca6 adding user prompt for db creds 2026-04-18 12:37:22 -05:00
David Peterson
5b48872dd7 Add generate_sample_folder.py and load_folder.py for clustered SAS file generation and loading
Introduce generate_sample_folder.py to create a test folder with clustered SAS XPORT files, including configurations for schema compatibility checks. Implement load_folder.py to facilitate loading entire directories of SAS files into Postgres, supporting explicit and auto-detect clustering. Update sample_folder_config.yaml for usage examples and configuration structure. Enhance load_sas.py with a public schema compatibility check function for orchestrators.
2026-04-18 11:25:04 -05:00
michael-corey
6b12ab969b adding file_viewer 2026-04-18 11:19:38 -05:00
David Peterson
5645ff5597 Update load_sas.py to support streaming data loads with iter_sas_chunks and copy_dataframes. Enhance documentation for schema inference and type detection, clarifying the use of read_sas_preview and the implications of sampling. Add __pycache__ to .gitignore. 2026-04-18 10:44:32 -05:00
David Peterson
3a0537270c Implement type inference sampling in load_sas.py to improve performance on large SAS files. Introduce TYPE_INFERENCE_SAMPLE_ROWS to limit the number of rows scanned for type detection while ensuring nullability checks cover the entire column. Update documentation to reflect these changes. 2026-04-18 10:28:37 -05:00
David Peterson
4f7ded09c6 Enhance load_sas.py with detailed usage instructions, YAML config structure, and command-line interface documentation for loading SAS files. 2026-04-18 10:20:07 -05:00
michael-corey
f681f1012a Adding generic loader 2026-04-18 09:34:48 -05:00