Refine partition name patterns in sas_profiler.py

Updated the regular expression for partition name patterns to improve matching accuracy for state-related columns. The new pattern captures variations like `state`, `state_code`, and `statecode` while avoiding false positives from unrelated terms. This change enhances the precision of partition candidate selection.
This commit is contained in:
David Peterson 2026-04-20 19:27:01 -05:00
parent 4fc85081c8
commit a94ab68f4d

View File

@ -117,8 +117,12 @@ larger than the file, pyreadstat just hands back one chunk."""
PARTITION_NAME_PATTERNS: Tuple[re.Pattern, ...] = ( PARTITION_NAME_PATTERNS: Tuple[re.Pattern, ...] = (
re.compile(r"^state$", re.IGNORECASE), # ``state`` or ``state_code`` / ``statecode`` appearing as a full token
re.compile(r"^state_?code$", re.IGNORECASE), # anywhere in the column name. Uses underscore / start / end as token
# boundaries so we catch STATE, STATE_CODE, HOME_STATE,
# ADDR_LINE3_STATE, BIRTH_STATE_CODE, etc. without matching STATUS,
# ESTATE, INTERSTATE, or STATEWIDE.
re.compile(r"(?:^|_)state(?:_?code)?(?:_|$)", re.IGNORECASE),
) )
"""Only columns whose name matches one of these patterns are ever considered """Only columns whose name matches one of these patterns are ever considered
partition candidates. This deliberately ignores generic low-cardinality partition candidates. This deliberately ignores generic low-cardinality