Weathering the SQL interpreter

Understanding how the SQL query is constructed by the interpreter, can make a data professional’s life much easier

James McNeill
CodeX

--

Photo by Mark Lawson on Unsplash

Introduction

When working with any dataset you can expect to spend close to 80% of your time discovering and preparing the data. Picking the right tools can increase efficacy and reduce the time required to complete this process. By performing pre-processing steps you can reduce the time taken to deliver reliable data insights. You may be faced with balancing different relationships in the data and this is where SQL can be utilised.

Brief background to SQL

The Structured Query Language (SQL), commonly pronounced “sequel”, in reference to the initial version’s name, was developed during the early 1970s. The language was designed to manage data in a Relational Database Management System (RDBMS). With this relational model component in place, SQL is used to make the task of processing multiple tables much more efficient. If you are dealing with large tables across multiple excel files, passing these into a database can help to improve the processing times when querying the data. After all, a search algorithm such as a VLOOKUP() or INDEX() will only work efficiently for so long. As the number of rows increases, then the required processing time will progressively increase. By harnessing the power of a relational model, the algorithms can work much more efficiently.

Now it is time to start working with an example dataset and begin to understand how the SQL interpreter executes a query.

Kaggle dataset

Kaggle is a great place to get started with your data engineer, data analyst or data scientist journey. There are lots of public datasets to clean, explore, visualise and build models with. Having a wide range of areas to dive into, it only makes sense to review data that impacts us all each day. The weather of course and who doesn’t love discussing it?

We will be performing some simple SQL processing steps to understand what is contained within the dataset.

SQL interpreter compilation steps

Firstly, lets review the main key word components used within SQL. The following list shows the key operational steps taken by a SQL interpreter when compiling a SQL query:

  1. FROM, including JOINs

Tables defined within the FROM clause will be first in the evaluation process. If a join is included then the data from all tables will be merged using the ON clause from the JOIN. Should a subquery be present then a temporary table may be created. Point of note, in many cases the database optimiser will seek to review the WHERE clause first, in order to understand if the table to be included in the query can be filtered in any way. With this process step in place, the database optimiser has worked with the SQL interpreter to ensure that only the data required to be reviewed in the query is included.

2. WHERE

The second feature to be evaluated. We are able to include all variables contained within the FROM tables. However, we can not include any aggregated columns. To filter on these aggregated values we will have to use the HAVING filter clause. By including an appropriate WHERE filter, we can refine how much of the dataset the search algorithm has to process when returning query outputs. Applying a filter to an indexed value would be an efficient method to follow.

3. GROUP BY

Aims to separate the datasets into chunks or buckets. By excluding this clause it is similar to having all rows in one bucket. With the separate buckets in place we are able to generate aggregated functions e.g. COUNT(), MIN(), MAX(), SUM() etc. These aggregated values help to provide summary details of the dataset.

4. HAVING

In order to work with a HAVING clause, there must have been a GROUP BY. We could use the HAVING clause to filter out certain buckets from the analysis if it makes sense to do so. Whether invoking the aggregated function or its alias, this clause helps to prevent additional analysis being performed in a separate query.

5. WINDOW functions

To be reviewed in a later article.

6. SELECT

The SELECT statement returns the values from the columns chosen by reviewing any of the previous clauses and discarding any rows that are not relevant.

The remainder of these operational steps will be reviewed in a future article.

7. DISTINCT

8. UNION

9. ORDER BY

10. LIMIT and OFFSET

SQL data analysis

SQLite was used to complete the analysis performed below. Many other versions of the SQL implementation exist on the market place with a few examples of; Oracle, MySQL, MS SQL Server and PostgreSQL. Along with various cloud computing options, the scale of SQL operations available continues to grow. However, the key guiding principles mentioned above can be applied in each.

With a weather related dataset, there are a few different seasonal affects that can be reviewed. Creating a list of questions before querying the dataset can help to provide the structure required to perform data mining tasks. For this tutorial, the aim will be to understand which month(s) of the year have the highest average temperature. An area of the country will be picked and retained throughout the process to help with explanations.

Before going any further an initial guess for highest average temperature would be a summer month. Let’s see how close we are.

Metadata

Code 1.1 Displaying the metadata

By understanding the Metadata of a database, you can maintain an overview of the files contained within it. In the output below the initial table is the only file that exists. However, this list will increase over time as more analysis is performed.

SQL Output 1.1 Provide overview of the datasets contained within the SQL database

From

When first reviewing a dataset in SQL, you maybe tempted to select all the records.

Code 1.2 Shows the difference between selecting all columns and specific columns

In the first section of code 1.2, we can see that by selecting all values, the SQL output 1.2 will inefficiently return everything to the output view.

SQL Output 1.2 Select all columns from the dataset

A much more efficient method is shown in the second query from code 1.2. We can see a shorter turn around time is required in the output below, when returning the columns of interest. In future articles we will see how using the limit operator can help to select a small sample of a dataset. With this operation we are able to see a small summary of the columns. A value of five is usually taken.

SQL Output 1.3 Select specific columns from the dataset

Where

Having a list of variables to query is one thing but we need to start refining the search. Using the where clause ensures that we can filter for specific sections of the dataset. A search algorithm reviews the filter and returns all of the matching values. If the column being filtered has had an index applied to it, then the search can be applied more efficiently. If an index is not used then sorting the column values can also help to reduce the filter overhead. However, the majority of variables are left in random orders, therefore, the search algorithm will have to review all values in the column before returning the final results.

The code and outputs below show a few different variations of the where clause being applied. In the first query we have used one filter to select all the values from county Galway.

Code 1.3 Two variations of the where clause to show differences in specificity
SQL Output 1.4 Review the dataset by filtering on one column

Whereas, in the second query we have filtered on all the values and selected a banded temperature range between 12 and 15. The additional clause shows how specifying for a particular cohort can have a similar run time but returns a much smaller dataset. It is this filtering dynamic that allows us to answer more specific analytical questions about the dataset.

SQL Output 1.5 Review the dataset by filtering on multiple columns

Group by

Beginning to incorporate aggregations into the data analysis helps to create summaries. The first query in code 1.4 provides the distribution of rows with the dataset by two columns.

Code 1.4 Simple and advanced group by aggregations
SQL Output 1.6 A simple row count operation with the group by columns

By incorporating the where clause, we are able to provide more insights into the previous overall summary. Including additional metrics helps to clarify the range of values available within each column. The alias assigned after each aggregation allows for any range of free text values. However, incorporating the aggregation (e.g. min, max, avg) and a short name for the column, really helps to provide context of the calculation.

SQL Output 1.7 An advanced set of data aggregations with the group by columns

From Join

Producing summary values in the previous section begins to tell the story of what is taking place in the Irish weather dataset. Within data analytics, the story can be taken a step further by joining datasets together. In code 1.5, we have incorporated the summary query from code 1.4 as a sub-query. A sub-query can be thought of as a temporary table that is created within the query.

Code 1.5 The aggregated columns of a group by can be added with an inner join of a subquery

By joining the overall dataset and the sub-query, we can assign the max and min temperature values to all rows. Further calculations can be performed to understand how the individual temperatures per row compare to the summary statistics. Therefore, providing a reference point of how the temperature relates to the wider population, helps to understand the relative temperature variance across time.

SQL Output 1.8 The current temperature can be compared to the range values for the column

Having

Similar to the where clause, having allows the query to make use of the aggregated calculations to perform filtering on the output. In the example, we have created the average temperature and filtered having temperatures of greater than 15 for each month.

Code 1.6 The having operator making use of the aggregated variable

As the date variable was imported into the database as a string value, we can use the string method substr(), to extract elements of the column value that relate to the required date information.

SQL Output 1.9 Reviewing the average temperature by month

Keeping the original question in mind, which months had the highest average temperature?

Code 1.7 Creating a temporary view to summarise the final output

By using the WITH clause we are able to invoke a temporary table to summarise the data.

SQL Output 1.10 Summary output table

It was no surprise that the three summer months had the highest number of warm months. However, it is very interesting to see that falling into September a few warmer temperatures persisted.

Conclusion

There are many different options available when beginning to perform data analytics processes from Excel to Python to R to Tableau. We have used SQL in this exercise to show how databases can be queried to create summary statistics. By understanding how the SQL interpreter processes the data, then we can submit efficient queries to answer our questions. In future articles we can work to automate the processing steps and understand the other features of SQL.

For the time being, we should be glad to see that heading into the summer the warmer weather might last a bit longer.

Thanks a million for reading! We all learn better by sharing! Until next time!

--

--