Mastering the SAS DATA Step: Reading, Transforming, and Combining Data
The DATA step is where SAS transforms raw data into analysis-ready datasets. Mastering it means understanding how SAS reads data row by row — the Program Data Vector (PDV) — and using that knowledge to write efficient, accurate transformations. This guide goes beyond the basics.
How SAS Reads Data: The PDV
SAS processes data through a compile phase (checking syntax, defining variables) and an execution phase (reading each observation). During execution, SAS uses the Program Data Vector — a temporary memory space holding one observation at a time. At the end of each iteration, the output observation is written to the dataset and the PDV is reset. Understanding this explains many SAS behaviors, including why you need RETAIN to preserve variable values across iterations.
Reading External Data
PROC IMPORT reads Excel, CSV, and delimited files with minimal coding: PROC IMPORT DATAFILE='/path/file.csv' OUT=mydata DBMS=CSV REPLACE; RUN;. For more control, use an INFILE/INPUT combination in a DATA step. INFILE specifies the file; INPUT specifies the variable names and formats. The DELIMITER= option handles non-comma delimiters. FIRSTOBS= and OBS= control which rows to read.
LENGTH name $ 50; before the INPUT statement, or use the LENGTH option in INPUT formatting.
Conditional Logic and Functions
IF-THEN-ELSE controls execution: IF salary > 100000 THEN tier = 'Senior'; ELSE IF salary > 60000 THEN tier = 'Mid'; ELSE tier = 'Junior';. SELECT-WHEN is more efficient for multiple conditions on the same variable. SAS functions are extensive — character functions (UPCASE, SUBSTR, COMPRESS, SCAN), numeric functions (ROUND, INT, ABS, MAX), date functions (TODAY(), MDY(), INTCK), and more.
Combining Datasets: Concatenation and Merging
Concatenation stacks datasets vertically: DATA combined; SET dataset1 dataset2 dataset3; RUN;. Merging joins datasets horizontally by a key variable. Datasets must be sorted by the key variable first, then: DATA merged; MERGE left right; BY id; RUN;. The IN= data set option identifies which dataset contributed each observation, enabling inner joins, left joins, and anti-joins through conditional logic after the MERGE statement.
Output Delivery and Subsetting
The OUTPUT statement explicitly writes the current PDV to the dataset at any point — enabling you to write multiple output observations from a single input row. The KEEP and DROP statements specify which variables appear in the output dataset. The WHERE statement filters input data before processing; IF without THEN in a DATA step acts as a subsetting IF, filtering observations.
See our guide on PROC SQL in SAS for an alternative approach to data manipulation, or visit our tutorials for practice exercises.