Deep dive into SQL window functions

The SQL window function performs calculations across a set of table rows to streamline data analysis

James McNeill
Towards Data Science

--

Photo by Vidar Nordli-Mathisen on Unsplash

Many times during data analysis we face the challenge of having to create aggregate calculations for a group of rows. Numerous examples emerge of having to create multiple summary outputs across a number of temporary tables. Once the new results are created, they are merged with the original dataset to continue with the data analysis process. Having this continual circular reference of creating and appending summary statistics can result in a longer process flow than what is required. Making use of the SQL window functions allows a developer to avoid the case where large amounts of data are being produced and memory consumed just for small gains.

Within this article, we will show some of the initial attributes that SQL window functions can help to solve. In addition, a few pieces of time series functionality will be introduced to show steps for processing data across time.

Dataset and software

As many people have an interest in property values, we will review a house price index that provides a macro-economic perspective. We will review the Residential Property Price Index for the Republic of Ireland, taken from data.gov.ie¹. Note that some pre-processing was completed to the dataset to make it easier to work with. The character variable “Type of Residential Property” contained details for country region and type of property. Therefore, it made sense to split this variable into two separate variables named “Area” and “Property_type” within the dataset.

The SQL code shown in this article was created using an instance of Microsoft SQL Server Management Studio 18.

When using SQL window functions there are a number of different categories available. We are reviewing the three common category levels; aggregate, analytical, and ranking, functions. The addition of a time series processing section shows the temporal elements that are available to review.

Aggregate Functions

Aggregation provides summary statistics of the dataset being reviewed. Some of these basic statistics help us to understand the structure of numerical variables. From this insight, we are able to better understand the story of the data. Usually, these summary values are created within a separate query and then the statistics are joined back to the first dataset. Creating a temporary table can work with ad-hoc analysis. However, if we want to change the range of values being assessed, it can potentially lead to errors as multiple sections of code would need to be updated. Having a more automated approach can really help to improve code useability.

This is where the window functions can help to provide greater flexibility. In the code below we have the basic overview of the window function in operation. The over keyword specifies the window clauses for the aggregate functions. Within the parenthesis, the first option is for the partition by fields to use. It is the list of partition field(s) that define the window or set of rows on which the window function will operate. Next, we have the order by option which will sort rows for the partitioned variables.

SQL code 1.1 range of aggregate window functions (Image by author)

From the code above, we have shown four statistical options available. The count function will record the number of rows within the window. With the other three functions used, these will progress through the window one row at a time.

SQL output 1.1 aggregate function values changing by date (Image by author)

With the alias avgValue, the average method is used to understand how the average value changes as more rows are added. Including a minimum and maximum value, helps to review the boundary of values. As a new window emerges from row 7, we can see how the aggregate variable values will reset.

Analytic Functions

To perform more advanced operations with the window functions, the area of analytical functions can be used. We will see that many of these functions will help to move data points from different sections of the window.

SQL code 1.2 working with the lag analytical function (Image by author)

From the code above we have worked through the many different options available with the lag function. By default, the lag function will take the first lagged value. For row 2 in the output below, we can see the lagValue has taken the data from the value variable from the first row. To find the lagged data from three months previous, a parameter value of three can be added to the lag function parenthesis.

The remaining two functions highlight how assigning a value for the null values can be performed when data is not available. Firstly, the use of coalesce will review the function and understand if a non-null value will be produced. We must remember that the order of inputs to the coalesce function is key, by accessing the first input if this is available then the value will be used. Should this not be the case then the second parameter value of zero will be taken. We can see that the lag function provided this functionality as well and including the third parameter of zero in line 20 shows how this method can be completed in one step.

SQL output 1.2 different impacts of using the lag function settings (Image by author)

Having the ability to take lagged values will help when producing time-series comparisons of the dataset. Once again the window function helps to automate a number of these lagged value options, instead of having to perform multiple joins. By taking the three month lagged values this helps to review the quarterly changes. Including a six or twelve month difference would have shown the half yearly or annual movements.

SQL code 1.3 understanding the leading, first, and last value options (Image by author)

The lead function will bring values from the future back to the current row. Similar to the lag function a default option of 1 is used and the filling in of a null value can be assigned as well. Using the first_value function provides details of the beginning of the window. This can help if the range of values is being reviewed with a certain time series date in mind. For this example, the beginning of the calendar year is taken for the first value. However, when using the last_value function an additional clause is required to ensure that the entire window of values is being reviewed. Including a range unbounded at the beginning (preceding) and end (following), completes this process.

SQL output 1.3 values at the beginning and end of the range can help define boundaries (Image by author)

The first and last values would help to show how the intermediate values moved during the time period of the window. Should the data analysis require a relative comparison to these data points then the calculation can be completed quickly.

Ranking Functions

Incorporating ranking functions into the data analysis will rank the values of a specified field. Many times we are interested in understanding the top (N) records of a certain value. Other times the bottom (N) records may provide more insight depending on the question being asked.

SQL code 1.4 ranking options provide insight into the range of values by the partition grouping (Image by author)

The rank function aimed to understand the values from highest to lowest. With this assessment, we are looking at the top (N) values. As we can see in the code a descending keyword was added to ensure that the order by clause treated the values properly. To maintain the original order of the input dataset, we have included an order by clause on row 16 to re-order the results after the window function is completed. Being able to re-order the results shows the flexibility available with the window function.

For the row number function, using the partition by clause ensured that the row count would reset for each window. The output for this variable is similar to the earlier count function.

SQL output 1.4 difference between ranks can help to understand the distribution of values (Image by author)

If the results had been maintained in the window function output order then the temporal nature of the data would have been lost. As we can see rows 1 and 2 would have switched positions. This output would have been fine if the data was not related to a time series. Care must be taken to ensure that the output results align with expectations.

Time Series Processing

When reviewing a House Price index, data analysis concentrates on the movement of values over time. Using the knowledge gained from earlier sections of this article we are able to create a number of initial time series functions. Understanding trends across time can help to provide more insights into the macro-economic environment that the data is generated on.

SQL code 1.5 using time series options to understand moving averages and percentage changes (Image by author)

An aggregate value is first generated to average the values across a time window. A simple moving average can help to reduce the volatility with time-series data points. The result of this operation should show a smoother data visualization of how the data points have changed over time. In order to create a three-month average, the rows option has been included to reduce the number of rows to include. The options added will ensure that the data points used are a range between the current row and the two preceding rows.

To create the percentage change across three months we are able to use the previous lag function. A comparison takes place between the current row and the lagged value from three months ago. By using the division and subtracting one from the result, a negative output will show a decrease and a positive output will show an increase.

SQL output 1.5 using time series analysis to understand value movements (Image by author)

When using the percentage change logic, the zero default value has been excluded. If the value had been included then the SQL interpreter would have returned a division by zero error. The result of this is that three null values are present at the beginning of the calculations. As more data is added to the calculations then a smoother trend can begin to emerge.

Conclusion

This article worked with the SQL windows function to highlight what can be achieved within a SQL query. The feature engineering processes that took place showed how aggregate statistics could be produced for each window. Previously, we would have had to produce these statistics within a temporary table and then join this to the original table. However, we have learned that the windows function provides the option to avoid this step within the process flow. For the analytical and ranking functions we saw how time series metrics could be produced to help with data visualizations. Within machine learning models we could use these new features to aid with forecasting future values.

Leave your comments and thanks very much for reading!

--

--