Strings in SQL part 1

Extracting meaningful insights from character columns using SQL string techniques

James McNeill
Towards Data Science

--

Photo by Federico Burgalassi on Unsplash

Introduction

After reading this article we should have more confidence to extract the data insights contained within character variables. A character variable or string can be represented by text values such as a person’s; name, job, or address. Many of the ML models that are used require features to be in a numeric format to extract meaning. With string manipulation techniques data can be cleaned and converted into the necessary format needed.

The SQL code shown in this article was created using an instance of Microsoft SQL Server Management Studio 18. Each of the examples used aims to highlight the basic concepts of the different SQL string functions that are available.

Input data

For this piece of analysis, a sample of the IBM HR Analytics dataset has been taken from Kaggle¹. Within this dataset a number of character variables are available.

CHARINDEX

The character index method can be used to review the presence of a string value within a character column.

SQL code 1.1 Using the charindex method to locate expressions

In the SQL code shown above, we are searching for the value “sales” within the first application of the method. The string search shows that the value is not case-sensitive, as the actual value in the column searched shows a camel case format. As the filter clause is applied to the dataset then only values returning a match to the filter expression will be shown in the query results.

SQL output 1.1 Results from the charindex review of the JobRole column

As the output above shows, the two string values that have been searched for will return the index position that represents the first character which matches the overall string value being searched for. Should the string value not be present then a value of zero will be returned. We can see within the second row that the column value “Representative” will not match the “Exec” value that was searched for and so a zero is returned.

PATINDEX

When reviewing for a pattern contained within a column, the pattern index method is appropriate.

SQL code 1.2 Identifying patterns within the JobRole

By making use of the percentage symbol we are providing a wildcard to the method. It is this wildcard that allows for any number of other values to be present between the pattern that is being searched for.

SQL output 1.2 The range job titles that match the pattern

The results highlight that two values were present before the first letter “b” was found and then six values were present before the second letter “y”.

SQL code 1.3 Reviewing a list of values for the pattern index to search with

Within this review of the job role column, we are looking for a list of values that are contained anywhere within the string column.

SQL output 1.3 The letter x appears to be the most common from the pattern list

As can be seen from the SQL output only the letter “x” was identified within this search. However, should any or all of the letters from the list be present within the row, then these results would be shown.

LEFT & RIGHT

These two methods can be used to return a section of a string beginning from the start or the end of the string.

SQL code 1.4 Extracting elements of the string from the start and the end

By choosing the left method, the first three character values from the start of the string are returned.

SQL output 1.4 The elements of the string that have been returned

Whereas with the right method the final three character values have been returned. Each of these methods can help to return the results required if the beginning or end of the string is where your analysis is starting. However, if another starting position is required then the use of the substring method is much more practical.

SUBSTRING

Being able to identify the start and length positions of the string that is required is the core of the substring method. By having these two positional parameters a greater level of flexibility is available when reviewing string variables.

SQL code 1.5 Creating a string logic to test the method

With the sentence variable that has been declared, there are a number of items that can be extracted. Using the substring method with start and length values from the index allows the user to find the two fruit items. As the substring index begins at one and the length position parameter is inclusive, “Apples” can be returned with the position parameter values of one and six.

SQL output 1.5 Extracting the fruit items from the string that was created

The second fruit has been identified by extending the start position value to the appropriate position from the index. With this example, the length position refers to the length of the fruit that is required. If the length positional parameter was not included then all string values after the start position would be returned in the query result.

REPLACE

After identifying the string values that we are interested in, there may be times when this string value needs to be adjusted. For this task the replace method is available.

SQL code 1.6 Adjusting the symbol within the string

From the department column, the ampersand has been found but it is required to be changed to “and” instead of the “&” symbol. The parameters in the method are the old item for the second parameter and the new item for the third parameter.

SQL output 1.6 Returning an extended version of the string with the new_name column

As the results highlight the new department name has been created as expected. This is a small example of what can be achieved when identifying text that could be converted into another format.

Conclusion

A lot of additional information can be gained from data that is contained within text columns or features. Within this article, a number of different methods have been shown to help with this extraction process. Each method highlighted how understanding the index of the character string can provide opportunities to create new features.

In the next article, we will continue to explore the different SQL string methods that can be used.

Thanks very much for reading

[1] : Kaggle dataset IBM HR analytics attrition dataset from https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset, with a license agreement of https://opendatacommons.org/licenses/dbcl/1-0/

--

--