Refine partition name patterns in sas_profiler.py

Updated the regular expression for partition name patterns to improve matching accuracy for state-related columns. The new pattern captures variations like `state`, `state_code`, and `statecode` while avoiding false positives from unrelated terms. This change enhances the precision of partition candidate selection.
2026-04-20 19:27:01 -05:00 · 2026-04-20 19:27:01 -05:00 · a94ab68f4d
commit a94ab68f4d
parent 4fc85081c8
1 changed files with 6 additions and 2 deletions
--- a/utils/sas_profiler.py
+++ b/utils/sas_profiler.py
@ -117,8 +117,12 @@ larger than the file, pyreadstat just hands back one chunk."""
 PARTITION_NAME_PATTERNS: Tuple[re.Pattern, ...] = (
-    re.compile(r"^state$", re.IGNORECASE),
+    # ``state`` or ``state_code`` / ``statecode`` appearing as a full token
-    re.compile(r"^state_?code$", re.IGNORECASE),
+    # anywhere in the column name. Uses underscore / start / end as token
    # boundaries so we catch STATE, STATE_CODE, HOME_STATE,
    # ADDR_LINE3_STATE, BIRTH_STATE_CODE, etc. without matching STATUS,
    # ESTATE, INTERSTATE, or STATEWIDE.
    re.compile(r"(?:^|_)state(?:_?code)?(?:_|$)", re.IGNORECASE),
 )
 """Only columns whose name matches one of these patterns are ever considered
 partition candidates. This deliberately ignores generic low-cardinality