Data Quality in SAS: Detecting and Fixing Dirty Data

Published: January 24, 2026 | Author: Editorial Team | Last Updated: January 24, 2026
Published on sasbase.com | January 24, 2026

The most sophisticated statistical model built on dirty data produces unreliable results. Data quality work — identifying missing values, duplicate records, impossible values, and formatting inconsistencies — is unglamorous but essential. SAS provides a rich set of tools for profiling, validating, and cleaning data before it enters any analytical pipeline. Investing time in data quality checks upstream prevents cascading errors in reports and models downstream.

Profiling Data with PROC FREQ and PROC MEANS

Before writing a single cleaning rule, profile your data to understand what you are working with. PROC FREQ on character variables reveals all distinct values and their frequencies, immediately exposing typos like "Femal" alongside "Female" or "TX" and "Texas" representing the same state. PROC MEANS or PROC UNIVARIATE on numeric variables shows the distribution, minimum, maximum, and percentiles. An age variable with a maximum of 247 is clearly wrong. A revenue column with negative values needs investigation. Saving these profiling results to ODS HTML lets you document the initial data state and compare it after cleaning.

Handling Missing Values Systematically

SAS represents missing numeric values as a period and missing character values as a blank string. The NMISS function in PROC MEANS counts missing observations per variable in one pass. In the DATA step, IF variable = . or IF variable = "" identifies missing rows for imputation or exclusion. Simple imputation strategies — replacing missing values with the variable mean, median, or a sentinel value like -9 to flag them explicitly — should be documented in a data dictionary so downstream analysts know how gaps were handled. Never silently drop records with missing values without logging how many were removed and why.

Detecting and Resolving Duplicates

Duplicate records corrupt counts, inflate sums, and bias averages. PROC SORT with the DUPKEY option plus a dedicated DUPOUT= dataset captures all duplicated keys in one step. Alternatively, PROC SQL with COUNT grouped by the key columns and a HAVING COUNT(*) greater than one identifies duplicated keys quickly. Once identified, choose your resolution strategy deliberately: keep the most recent record by date, keep the record with the fewest missing values, or route all duplicates to a review file for manual adjudication. Document the rule applied so auditors can trace every retained record back to a defensible decision.

Standardizing and Validating with Pattern Matching

PROC FORMAT creates custom formats that map raw coded values to labeled categories, centralizing recoding logic that would otherwise be scattered across dozens of IF-THEN blocks. Regular expression matching via the PRXMATCH and PRXSUBSTR functions validates and extracts patterns — confirm that phone numbers match a ten-digit pattern, that email addresses contain an @ symbol, or that ZIP codes are exactly five digits. Building a validation report that lists every record failing each rule, with a count and percentage, gives data owners the specific information they need to fix problems at the source rather than requiring perpetual downstream cleaning. Visit SASBase or contact us to discuss your data quality challenges.

← Back to Home

Subscribe to Our Newsletter

Join 10,000+ subscribers. Get the latest updates, exclusive content, and expert insights delivered to your inbox weekly.

No spam. Unsubscribe anytime. We respect your privacy.