Strings in SQL, Part 2

Aggregating and segmenting string variables with SQL

James McNeill

Published in

Towards Data Science

5 min readJan 13, 2022

Introduction

The first article on strings helped provide some initial insights into data manipulation for this variable type. We were able to extract elements of the string to search for common cohorts. By making use of indexing methods, any number of characteristics within a string can be studied.

Strings in SQL part 1

Extracting meaningful insights from character columns using SQL string techniques

towardsdatascience.com

This article aims to show how data aggregations can be incorporated into data analysis.

Input data and software

For this piece of analysis, a sample of the IBM HR Analytics dataset has been taken from Kaggle¹. Within this dataset several character variables are available.

IBM HR Analytics Employee Attrition & Performance

Predict attrition of your valuable employees

www.kaggle.com

The SQL code shown in this article was created using an instance of Microsoft SQL Server Management Studio 18.

CONCAT & CONCAT_WS

With concatenation, the method aims to combine multiple string values into one overall string. A simple example would be combining the first and second names to create one overall name. Within SQL there are many methods available to use. The inclusion of three user-defined declare statements provides flexibility to create the desired message.

SQL code 1.1 Working with concatenation methods

From the SQL code above, lines 10 and 11 show the first concatenation method of using a plus symbol. With this simple method, variables can be combined quite easily. The addition of the blank space is required to ensure that a space is added between each separate string. If we forgot to include this element, then all of the text would be compressed together when one variable ends and the next begins. Also, note that we have shown how the JobSatisfaction variable could be converted to a different string data type using the cast method. Should the variable data types be anything other than a string data type then this cast method could be used to convert.

Taking the concat method still requires each space separator to be included between each string variable. Separating each element of the string using a comma makes things easier. However, there is still the chance for a space to go missing. Having the separator declared once is where the concat_ws method simplified the creation of the concatenated string. Providing this separator at the beginning of the method ensures that there is less coding required and reduces the chance of an error.

SQL output 1.1 Results from the first concatenation method

The SQL output displays how the concatenated string has been generated. Each of the methods will have produced the same result.

STRING_AGG

When working with larger datasets there can be times where string methods will aim to display too much information. For the examples displayed in this article, we are using a sample of 1,000 rows of tabular data. When running certain SQL string methods this can result in errors being displayed, as the query has exceeded a pre-defined memory capacity. However, one method to work around this issue is to create a Common Table Expression (CTE) which provides a temporary named result set.

SQL code 1.2 Creating the CTE reference dataset

The SQL code will store the summary query within the Job_Roles variable. In the query, we have summarised the data by the distinct job role to reduce the number of potential values for the string_agg method to work appropriately.

SQL output 1.2 Summary of the job role volumes

From the results of the CTE query, we can see that some roles have a higher proportion than others.

SQL code 1.3 Working with the string aggregation method of the job roles

To produce a list of the job roles, the string_agg method will take the string variable JobRole and create a concatenated list. Using the results below we could iterate through the list and filter the overall dataset for each of these cohorts.

SQL output 1.3 Displaying the list of job roles in one string

As the job roles were displayed in the order that they first appeared in the initial dataset, it can be tricky to understand any similar roles. Within the SQL code that follows, we see how string_agg group can be ordered alphabetically.

SQL code 1.4 Method to order the list of job roles

The ordering of the list could help to process the data in a more meaningful way.

SQL output 1.4 Job roles have been ordered alphabetically

STRING_SPLIT

So far we have worked with methods that combine string values. With this method, we will segment the string variable into its unique parts. At times we are required to understand the number of words that are contained within a string so using this method can help with the initial discovery.

SQL code 1.5 Using a splitting method to create the tags within the declared string

By declaring a string message we can experiment with the method. As the message is separated by spaces, this is the separator that is included within the method's second parameter.

SQL output 1.5 List of tags separated into each row of the output dataset

The query output shows each of the elements that have been tagged by the split method. A future task could be to count each tag to understand the term frequency.

Conclusion

In this article, we have shown how strings can be combined and split. By understanding the questions we want to ask the data, we can create summarised query results first before applying string methods. Using the concatenation methods allows the user to create messages or IDs that could be used to process the data further. Finally, we see how splitting a string can provide the list of tags that could be reviewed within Natural Language Processing (NLP) data analysis.

Thanks very much for reading

[1] : Kaggle dataset IBM HR analytics attrition dataset from https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset, with a license agreement of https://opendatacommons.org/licenses/dbcl/1-0/