Thursday, November 19, 2009

Random Sample and Top Function in Teradata SQL

Pulling a random sample is Teradata SQL is very easy-- after your code, simply put "sample 1000" if you want to sample 1000 records. Alternatively, you can put a number less than 1 to pull a percentage sample. For example, "sample .5" will give you a 50% sample of the results.

If you are using this to query a large dataset and really just need a sample to see what the data looks like, you can use the top command. Top will return to you the top records in your query-- the advantage over sample is that it is much faster; the disadvantage is that it is not random. The syntax for top is "sel top 10 * from table_x". That will quickly return the top 10 rows from that table.

Using the PARTITION BY function in Teradata SQL

The PARTITION BY command in Teradata is a very powerful function. The best use that I have found is when you have to find the minimum or maximum value, it can retrieve that row in one step instead of two.

One way to retrieve the first/last row across a certain criteria is to first build a table (or subquery), group by the attribute, and then use the min/max function. Then take that dataset and join it back to the original dataset to retrieve the row. The PARTITION BY command will do this in one step.

The syntax of the PARTITION BY command can be daunting at first, but it really isn't that complicated:

QUALIFY ROW_NUMBER() OVER (PARTITION BY X ORDER BY Y DESC) = 1

You put this directly under the chunk of code that you want to partition. The only parts of this code that you will usually change are X, Y, and DESC.

X is the field that you want to break on. It is the field that you would group by in the first step of the two-step technique.

Y is the field that you want to sort by.

= 1 means that you will get the first row returned for each value X. So, if you set the order for Y to be ASC, you will get the first row (MIN). If you set the order for Y to be DESC, you will get the last row (MAX).

Hope this helps!