SAS on Netezza is a pretty common use case. However there are some things to consider to ensure one is getting the most out of both their SAS environment and their Netezza environment. SAS can be used for a variety of operations, including ETL & transformation of data, and reporting ranging from basic aggregations up through statistical modeling. The big things we’re looking for in a SAS on Netezza environment are keeping data in the database, and pushing work to the database.
SAS is a tool and a programming language. Most development methodology tends to be agnostic of where the data lies, although this has major performance impacts at run time. Programs tend to be written as if data is laying within flat files on local SAS server SAN disk, even though often the data is available in databases like Netezza. While there’s no getting around that data must be read in its entirety of a file for SAS to perform compute operations on, treating the database as if it is a file-on-demand server is very wasteful. It ignores the compute engine within a database, and in the case of large datasets as we often see with SAS on Netezza, it also results in paying a significant network penalty to move data off Netezza into the SAS server – at which point much of it is actually thrown away.
A lot of performance improvements on SAS come as the result of reducing waste and inefficiency; often this is in reducing wasteful I/O operations to SAN disk. This is no different in looking for otpimizations against Netezza – we seek to reduce wasteful row movement operations, and push computational work to be done inside the Netezza database. It’s almost always going to be much larger than the SAS server, and the data is already there. An example of a common, but sub-optimal, use case is below.
This example is written in the commonly used SAS SQL – it is combining a local dataset, work.local_in (a file) to a remote dataset on Netezza, via the LIBNAME nz_remote. The Netezza tables in this case are remote_one and remote_two. A date filter is applied, although it is on the local file. What does SAS actually generate in the case of this query?
SAS doesn’t have the remote data – it also doesn’t know what will join. Nor does it have any restricts. What SAS actually does is then grab every column it needs for a return, joined, or restricted on result – for every record in the database. Every table in this example will have every record downloaded. Depending on the width of the data set and the number of records, the results can be pretty dramatic – I have seen resultsets extracted from Netezza in the range of 200 GB. In reality SAS often needs just 1% of the extracted data – significant effort is made to move, and then throw away, all of this data. SAS does not execute the above steps in parallel, but sequentially; this makes the effort take even longer as the first statement does not start until the first statement is done both reading and writing.
There are several techniques that can be applied to the original query above to achieve more desirable results when running SAS on Netezza; one option is to separate the single SAS SQL statement into two parts. Part 1 can run a Netezza only query, and bring back a more limited set of data. In this example we’d reduce 2 separate queries to 1; we’d push the SAS join into becoming a Netezza join, and we could additionally add restricts – such as a filter on a date column. In this way we can reduce the 10 million records the original queries might have brought back down to something more manageable like 10,000 of interest. A second PROC SQL could then join the more limited Netezza set to the local set.
A second technique would be to push the local set of data into Netezza. The program can open a write session to write into a temporary area on the database, with a subsequent PROC SQL that looks like the original query – although now with all objects on the database, Netezza can do all of the work, and return only the exact results of interest. As a final step, the program cleans up after itself and drops this temporary table.
Both techniques will significantly reduce the inefficiencies commonly found in code running SAS on Netezza. Aside from these data transfer savings, we can achieve additional performance improvements by doing transforms and aggregates in the database as well. Often when SAS programs are ported to run on Netezza from a traditional database, developers see a performance improvement something along the lines of a 30 minute job hitting Oracle turning into a 10 minute job hitting Netezza and conclude the porting job is done. The techniques above generally work for all databases, but especially so in the case of Netezza – further optimization of the sort above is how we get to 15 second jobs and a SAS and Netezza environment that scale to handle both a lot more users and reports.