advanced_analyzer #8

dp · 2026-04-21T22:23:01Z

dp commented

2026-04-21 22:23:01 +00:00

new analyzer but also has massive improvements to load_sas and load_folder

dp added 23 commits 2026-04-21 22:23:02 +00:00

Add standalone SAS profiling utility f1af1136dc

Introduced a new script `sas_profiler.py` that profiles local SAS files and generates an Excel report with recommendations for drops, partitions, and indexes, along with type-inference warnings. The utility supports command-line overrides for configuration and is compatible with Python 3.10+. This addition enhances the existing tools for SAS file management.

Add openpyxl dependency to requirements.txt for Excel file handling b3b968edf2

Refactor partition candidate logic in sas_profiler.py 5449a25b44

Updated the partition candidate selection process to restrict candidates to columns matching specific name patterns, improving accuracy and reducing noise. Removed outdated distinct value constraints and clarified documentation for partitioning behavior. Enhanced handling of pre-sharded columns and refined the classification logic for better performance.

Enhance SAS profiling performance in sas_profiler.py 4fc85081c8

Added a new constant for profiling chunk size to optimize memory usage during profiling operations. Refactored the update method in the _ColumnStats class to improve efficiency in handling missing values and calculating statistics for numeric and string data types. This update includes vectorized operations for better performance and clarity in the implementation.

Refine partition name patterns in sas_profiler.py a94ab68f4d

Updated the regular expression for partition name patterns to improve matching accuracy for state-related columns. The new pattern captures variations like `state`, `state_code`, and `statecode` while avoiding false positives from unrelated terms. This change enhances the precision of partition candidate selection.

Update type inference behavior in load_sas.py to scan entire files by default f84e127796

Changed the default setting for TYPE_INFERENCE_SAMPLE_ROWS to None, allowing type and nullability inference to consider all rows in a SAS file. This adjustment ensures accurate handling of null values and integer ranges, addressing issues observed in production with large datasets. Updated documentation to reflect the implications of this change and the risks associated with using an integer cap for sampling.

Add widening compatibility checks in load_sas.py for type inference 5e347f50ef

Introduced a new set of widening compatible type pairs to allow for accepting narrower inferred types when they fit within wider target types during schema compatibility checks. This change enhances the type inference process by preventing unnecessary mismatches and improving handling of varying integer ranges in cluster loads. Updated warning messages to inform users of accepted type adjustments.

Add pyarrow dependency and optimize DataFrame serialization in load_sas.py 7beb44ac4d

Included pyarrow as a new dependency in requirements.txt for improved CSV serialization performance. Refactored the _prepare_for_copy function to utilize vectorized operations for date and timestamp conversions, reducing CPU overhead. Introduced a new _serialize_chunk_csv function leveraging pyarrow for faster CSV writing, enhancing efficiency during data copying to Postgres.

Update requirements and enhance SAS file processing with progress tracking 96f2d6fe79

Updated the pyarrow version in requirements.txt to improve compatibility. Enhanced the _infer_cluster_schema and _stream_file functions in load_folder.py and load_sas.py to return total row counts for better progress tracking during data streaming. Integrated tqdm for visual feedback on row processing, improving user experience during large data loads.

Enhance load_cluster function for parallel processing and progress tracking fe7dc4d5a1

Refactored the load_cluster function in load_folder.py to support parallel file loading using ProcessPoolExecutor, improving performance during the append phase. Added workers parameter for controlling parallelism and integrated a progress_queue for real-time progress updates. Introduced read_sas_metadata function in load_sas.py to efficiently read metadata from SAS files, optimizing the pre-scan process for global progress tracking.

Refactor pre-scan process in load_folder.py to utilize ThreadPoolExecutor for improved performance 052fb0e087

Updated the main function to replace sequential file processing with a threaded approach using ThreadPoolExecutor. This change enhances the efficiency of reading row counts from SAS files, particularly for large datasets, by allowing concurrent I/O operations. Added progress tracking with tqdm for better user feedback during the pre-scan phase.

Add --no-prescan option to load_folder.py for skipping metadata scan 2dd247b067

Introduced a new command-line argument, --no-prescan, allowing users to bypass the per-file metadata scan during the loading process. This enhancement is particularly useful for large folders where the pre-scan may be time-consuming. The progress bar will still display rows loaded, rate, and elapsed time, but without an estimated time of arrival (ETA) for completion. Updated the main function to handle this new option and adjusted the progress tracking accordingly.

Enhance date and timestamp handling in _prepare_for_copy function in load_sas.py 1265489276

Added support for numeric date and datetime conversions from SAS formats. Implemented logic to handle float64 representations of dates (days since 1960-01-01) and datetimes (seconds since 1960-01-01), ensuring proper parsing and preventing errors during data copying to Postgres. This enhancement improves compatibility with various SAS date formats.

Refactor load_cluster function in load_folder.py for improved parallel file loading eac75cbb26

Updated the load_cluster function to enhance parallel processing by committing the table creation before dispatching all files to worker processes. This change allows for more efficient handling of large datasets by reducing the serial workload and ensuring schema compatibility checks can access the committed table. The logic for streaming files has been clarified, maintaining progress tracking throughout the loading process.

Add --chunk-rows option to load_folder.py for customizable memory management 9afb52aecb

Introduced a new command-line argument, --chunk-rows, allowing users to specify the number of rows per chunk for pyreadstat streaming and COPY operations. This option overrides the GENERIC_LOADER_CHUNK_ROWS environment variable and auto-scaling behavior when using multiple workers. Enhanced memory management by providing detailed information on peak memory usage based on the specified chunk size, improving performance and usability during large dataset processing.

Enhance memory management in load_folder.py and load_sas.py for improved performance 0c5e6e31f0

Added memory management optimizations in the _worker_load_append_file function to release unused memory from pyarrow's pool and trigger Python's garbage collection. Implemented explicit memory trimming using glibc's malloc_trim to ensure efficient memory usage during long-running processes. Updated the copy_dataframes function in load_sas.py to release pyarrow's memory pool between chunks, preventing high memory usage in long-lived workers. These changes aim to reduce memory footprint and improve overall performance during large dataset processing.

Add column type overrides in load_folder.py and load_sas.py for enhanced schema control ae65140390

Implemented a new feature allowing users to specify explicit column type mappings via a `column_types` configuration in both `load_folder.py` and `load_sas.py`. This addition enables users to bypass automatic type inference for specific columns, ensuring correct data types are used when loading datasets. Updated the YAML configuration files to include examples of the new `column_types` option, enhancing usability and flexibility in handling varying data formats across files.

Enhance error handling and abort functionality in load_folder.py for parallel file loading 212218fb67

Implemented an `--abort-on-first-failure` option in the `_load_remaining_files_parallel` function, allowing users to cancel all pending tasks immediately upon the first worker failure. This change improves user experience by providing real-time feedback on errors through stderr, ensuring that users are promptly informed of issues without waiting for all tasks to complete. Additionally, refined error reporting to maintain accurate summaries of successes and failures, even during interruptions.

Refactor numeric column type inference in load_sas.py for improved data handling 969a442775

Updated the logic for determining column types in the union_column_types function. Changed the default type from BIGINT to DOUBLE PRECISION for numeric columns without explicit format hints, ensuring better handling of both integer and float values. This adjustment prevents loading failures due to format discrepancies and maintains consistent data processing across various SAS formats.

Suppress PerformanceWarning in load_sas.py to reduce noise during processing of wide SAS files. This change filters out warnings related to DataFrame fragmentation, which are irrelevant for our pipeline as we directly convert DataFrames to pyarrow tables. a46f0518f6

Add safe numeric to datetime conversion in load_sas.py to handle edge cases c283b42876

Implemented the _safe_numeric_to_datetime function to convert numeric SAS-epoch series to datetime64[ns] while managing potential overflow and non-finite values. This enhancement improves error handling during data processing by masking invalid entries before conversion, ensuring robust handling of SAS date formats in the _prepare_for_copy function.

Add all_nullable configuration option in load_folder.py and load_sas.py for flexible schema management eff82c73ce

Introduced an `all_nullable` boolean option in both `load_folder.py` and `load_sas.py`, allowing users to specify whether all columns should be treated as nullable during schema inference. This feature addresses scenarios where the data sampling may incorrectly suggest that columns are non-nullable, preventing potential errors during data loading. Updated YAML configuration files to include examples of this new option, enhancing usability and providing clearer documentation for users.

Enhance error reporting in load_folder.py and load_sas.py for better debugging 64e7ff0b0a

Updated error handling in the _worker_load_append_file function to include full tracebacks in exception messages, improving context for failures during file loading. Additionally, modified the _safe_numeric_to_datetime function to provide detailed warnings when conversion errors occur, ensuring users are informed of potential data issues. These changes aim to facilitate easier debugging and enhance the robustness of the data loading process.

dp requested review from mc 2026-04-21 22:23:08 +00:00

mc approved these changes 2026-04-21 22:31:06 +00:00

dp merged commit 64e7ff0b0a into main

2026-04-21 22:32:18 +00:00

dp deleted branch advanced_analyzer

2026-04-21 22:32:25 +00:00

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: OFRA/foxtrot#8