WTF Are … SQL Functions
A quick recap.
In our journey so far through Learning SQL For Data Analysis, we’ve explored how to get data back out of our database table using SELECT.
We’ve seen how to filter our results to specific criteria using the WHERE clause.
And we’ve powered up the scale of our SQL querying abilities by learning how to JOIN different tables together in a multitude of different ways.
What we’re going to do now is start to level up the power of our SELECT statement itself.
Having the ability to SELECT either all columns in our table by doing SELECT * is good. Being able to SELECT individual columns within our table by specifying the column name (or number) is even better.
But now we’re going to introduce a new set of keywords that let us really tap into the full power of the SQL SELECT statement:
What are SQL functions?
You are probably familiar with functions if you have ever used Excel. And who hasn’t ever used Excel?
Functions in SQL do a similar job as the ones you’ll have encountered before. The SQL functions we’ll look at below in more detail - COUNT, SUM, MIN, MAX and AVG - all let you basic mathematical operations on your data.
From counting the number of rows to finding the maximum value of a certain column in your data table, SQL functions will soon become the lifeblood of your Data Analysis SQL career.
Let’s take a look at some of the most popular and mosy useful SQL functions in action with a worked example.
Let’s say we have a data table (called PLAYERS) of top professional men’s footballers and their weekly wage:
The COUNT function allows you to count the number of rows in your table. Sounds pretty simple, doesn’t it?
And it is.
You can use the asterisk (*) wildcard to count everything:
SELECT COUNT(*) FROM Players
or specify a particular column (e.g. PlayerID) like this
SELECT COUNT(PlayerID) FROM Players
In a simple version of the COUNT function like this, both of these queries will return the same answer. It will be a single data point, just giving the number of rows. In our example, that answer would be 5.
(The COUNT function does NOT count NULL values however so do be careful if you are likely to have some NULLs in your table.)
The SUM function allows you to do a summary calculation on a particular column in your data table. No wildcards here, we MUST specify the individual column we want to summarise.
The easiest way to think about the SUM function is to use it to add up all of the values in a particluar column.
If we want to find the total combined wage bill for our five football players, our SQL query would look like this:
SELECT SUM(PlayerWage) FROM Players
And the returned Answer = 1900000
If we want to find out what the lowest value within a specified column is, we use the MIN function. This gives us the MINIMUM value e.g.
SELECT MIN(PlayerWage) FROM Players
If we want to find out what the highest value within a specified column is, we use the MAX function. This gives us the MAXIMUM value e.g.
SELECT MAX(PlayerWage) FROM Players
If we want to find out what the AVERAGE (or MEAN) value within a specified column is, we use the AVG function e.g.
SELECT AVG(PlayerWage) FROM Players
(For SUM, MIN, MAX and AVG always remember to specify your chosen COLUMN.)
Let’s step it up a little more.
Using your new-found SQL FUNCTION superpowers on the whole data table won’t happen as often as you might hope. It’s more useful to our business area if we are able to split our data up into smaller segments to allow greater comparison across these very, very useful aggregated functions.
So we need a way to group togther certain data points to allow us to perform our functions and get a more useable result. And we do that with the GROUP BY clause.
Let’s try an example.
In our Footballer example above, it was certainly useful to see how many players we had in our table and how much their total wages were each week. On a grand scale like that though, it doesn’t really tell us if certain clubs are higher payers overall or if the top earning players are clustered at certain clubs.
We will re-do our COUNT function next but bring in another variable (the column TEAM) to see if one club has more high earners than the others.
Our SQL looks like this:
SELECT TEAM, COUNT(PlayerID) as NumberOfPlayers FROM Players GROUP BY TEAM
And our SQL results would look like this:
A quick scan of our results shows that Barcelona have more high earning players than any other team in our example data table.
So how did we set that up?
From the SQL, you can see we asked for two elements in our SELECT statement:
- A COUNT of the PlayerID rows which we used with an alias for output to make it more meaningful (NumberOfPlayers)
We wanted to query the PLAYERS table so that went in the FROM line.
And we used the GROUP BY clause at the end to specify that we wanted to group our results by the TEAM column to segment our results.
We’ve seen how to supercharge our SELECT statement with the use of SQL Functions and Aggregation. Using simple functions like COUNT, SUM and AVG will really push the power, quality and utility of your SQL queries on to the next level.
Combining these with basic segmentation and grouping clauses like GROUP BY only multiplies that power.
As a Data Analyst you’ll be using these queries day-in and day-out until the syntax flows from your fingers without any thought whatsoever. That doesn’t mean you can’t make a god awful mess of them though so do be careful.
You wouldn’t be the first data analyst to make a wrong move on a GROUP BY and over-inflate your overall aggregate sums. Always be on the lookout for that which might trip you up and you’ll be just fine.
Check out the next lesson from my Learn SQL For Data Analysis series.