foxtrot

Author	SHA1	Message	Date
David Peterson	1265489276	Enhance date and timestamp handling in _prepare_for_copy function in load_sas.py Added support for numeric date and datetime conversions from SAS formats. Implemented logic to handle float64 representations of dates (days since 1960-01-01) and datetimes (seconds since 1960-01-01), ensuring proper parsing and preventing errors during data copying to Postgres. This enhancement improves compatibility with various SAS date formats.	2026-04-21 08:16:17 -05:00
David Peterson	2dd247b067	Add --no-prescan option to load_folder.py for skipping metadata scan Introduced a new command-line argument, --no-prescan, allowing users to bypass the per-file metadata scan during the loading process. This enhancement is particularly useful for large folders where the pre-scan may be time-consuming. The progress bar will still display rows loaded, rate, and elapsed time, but without an estimated time of arrival (ETA) for completion. Updated the main function to handle this new option and adjusted the progress tracking accordingly.	2026-04-21 08:12:39 -05:00
David Peterson	052fb0e087	Refactor pre-scan process in load_folder.py to utilize ThreadPoolExecutor for improved performance Updated the main function to replace sequential file processing with a threaded approach using ThreadPoolExecutor. This change enhances the efficiency of reading row counts from SAS files, particularly for large datasets, by allowing concurrent I/O operations. Added progress tracking with tqdm for better user feedback during the pre-scan phase.	2026-04-20 22:43:02 -05:00
David Peterson	fe7dc4d5a1	Enhance load_cluster function for parallel processing and progress tracking Refactored the load_cluster function in load_folder.py to support parallel file loading using ProcessPoolExecutor, improving performance during the append phase. Added workers parameter for controlling parallelism and integrated a progress_queue for real-time progress updates. Introduced read_sas_metadata function in load_sas.py to efficiently read metadata from SAS files, optimizing the pre-scan process for global progress tracking.	2026-04-20 22:02:55 -05:00
David Peterson	96f2d6fe79	Update requirements and enhance SAS file processing with progress tracking Updated the pyarrow version in requirements.txt to improve compatibility. Enhanced the _infer_cluster_schema and _stream_file functions in load_folder.py and load_sas.py to return total row counts for better progress tracking during data streaming. Integrated tqdm for visual feedback on row processing, improving user experience during large data loads.	2026-04-20 21:44:49 -05:00
David Peterson	7beb44ac4d	Add pyarrow dependency and optimize DataFrame serialization in load_sas.py Included pyarrow as a new dependency in requirements.txt for improved CSV serialization performance. Refactored the _prepare_for_copy function to utilize vectorized operations for date and timestamp conversions, reducing CPU overhead. Introduced a new _serialize_chunk_csv function leveraging pyarrow for faster CSV writing, enhancing efficiency during data copying to Postgres.	2026-04-20 21:32:56 -05:00
David Peterson	5e347f50ef	Add widening compatibility checks in load_sas.py for type inference Introduced a new set of widening compatible type pairs to allow for accepting narrower inferred types when they fit within wider target types during schema compatibility checks. This change enhances the type inference process by preventing unnecessary mismatches and improving handling of varying integer ranges in cluster loads. Updated warning messages to inform users of accepted type adjustments.	2026-04-20 21:08:13 -05:00
David Peterson	f84e127796	Update type inference behavior in load_sas.py to scan entire files by default Changed the default setting for TYPE_INFERENCE_SAMPLE_ROWS to None, allowing type and nullability inference to consider all rows in a SAS file. This adjustment ensures accurate handling of null values and integer ranges, addressing issues observed in production with large datasets. Updated documentation to reflect the implications of this change and the risks associated with using an integer cap for sampling.	2026-04-20 20:43:27 -05:00
michael-corey	2390ce1e0c	adding explorer	2026-04-20 16:27:54 -05:00
David Peterson	b78f6d648f	Enhance file clustering by implementing numeric sorting for last digit groups in stems and updating documentation for embedded-digit handling in auto-detection.	2026-04-20 11:48:22 -05:00
michael-corey	b3d7a9d440	adding index field	2026-04-20 10:18:09 -05:00
michael-corey	0d955eeab1	adding partition flag	2026-04-20 09:56:00 -05:00
michael-corey	e39eb47a90	altering such that commit is by batch	2026-04-20 08:38:38 -05:00
michael-corey	2d95711d9d	Updating python reference	2026-04-18 13:43:29 -05:00
michael-corey	f101eacffd	Merging main	2026-04-18 13:39:37 -05:00
michael-corey	edb9146682	moving files	2026-04-18 13:35:32 -05:00
michael-corey	1bbe0d4cd6	removing latin encoding, adding usage notes	2026-04-18 13:06:01 -05:00
David Peterson	c1e1fec10b	Update requirements.txt to support new package versions and add boto3 dependency	2026-04-18 12:41:02 -05:00
michael-corey	3b913b2ca6	adding user prompt for db creds	2026-04-18 12:37:22 -05:00
David Peterson	5b48872dd7	Add generate_sample_folder.py and load_folder.py for clustered SAS file generation and loading Introduce generate_sample_folder.py to create a test folder with clustered SAS XPORT files, including configurations for schema compatibility checks. Implement load_folder.py to facilitate loading entire directories of SAS files into Postgres, supporting explicit and auto-detect clustering. Update sample_folder_config.yaml for usage examples and configuration structure. Enhance load_sas.py with a public schema compatibility check function for orchestrators.	2026-04-18 11:25:04 -05:00
michael-corey	6b12ab969b	adding file_viewer	2026-04-18 11:19:38 -05:00
David Peterson	5645ff5597	Update load_sas.py to support streaming data loads with iter_sas_chunks and copy_dataframes. Enhance documentation for schema inference and type detection, clarifying the use of read_sas_preview and the implications of sampling. Add __pycache__ to .gitignore.	2026-04-18 10:44:32 -05:00
David Peterson	3a0537270c	Implement type inference sampling in load_sas.py to improve performance on large SAS files. Introduce TYPE_INFERENCE_SAMPLE_ROWS to limit the number of rows scanned for type detection while ensuring nullability checks cover the entire column. Update documentation to reflect these changes.	2026-04-18 10:28:37 -05:00
David Peterson	4f7ded09c6	Enhance load_sas.py with detailed usage instructions, YAML config structure, and command-line interface documentation for loading SAS files.	2026-04-18 10:20:07 -05:00
michael-corey	f681f1012a	Adding generic loader	2026-04-18 09:34:48 -05:00

25 Commits