WTF Are … SQL JOINs

If you’ve been following the other posts in this Learning SQL For Data Analysis guide, you should be familiar with working with a table of data.

Levelling up your SQL skills means increasing the complexity of what we can do with SQL and the first step on that is learning how to join tables together.

Why join tables together?

Let’s say you have one table of COUNTRIES and another table of CITIES. Joining the two tables would let you see which cities are in which countries, e.g.

A table called COUNTRIES:

CountryID CountryName
1 USA
2 UK
3 CANADA
4 AUSTRALIA
5 FRANCE

A table called CITIES:

CityID CityName CountryID
1 NEW YORK 1
2 LONDON 2
3 MELBOURNE 4
4 PARIS 5
5 ROME

Or in a business setting, you might have a list of EMPLOYEES in one table and SALES records in another. Being able to link the sales records to the employee who made the sale could give you monthly totals per salesperson to judge whether they hit their targets or not.

How do we link two tables together?

By using fields in the tables known as “KEYS”.

Each table should have a field (or collection of fields) which operates as a unique identifier. This is known as a PRIMARY key.

It might be a single field (let’s call it “ID”). Or a collection of several fields (e.g. the location ID, account number and date the account was opened).

Whichever it is, the PRIMARY key should have these characteristics:

  • It’s UNIQUE in the table – you can’t have two rows with the exact same PRIMARY key.
  • It’s NOT NULL – it has to have a value.

If the PRIMARY key doesn’t meet these two criteria then it’s not a PRIMARY key.

Each table can only have one PRIMARY key.

What do we join the PRIMARY key to?

To join to another table we also need a FOREIGN key.

A FOREIGN key is a field or collection of fields which corresponds to a PRIMARY key in another table.

For example, CountryID is the PRIMARY key in our Countries table above. CountryID in the Cities table however is a FOREIGN key there. We can join CountryID in Countries to CountryID in Cities to join the tables.

When creating tables in SQL (with the CREATE TABLE command), we can specify which fields represent the PRIMARY and FOREIGN keys in the table. We’ll cover that in the CREATE TABLE section of this series.

Back to joins. So what different kinds of SQL joins can we use and what are the differences between them?

INNER JOIN

The INNER JOIN (also known simply as JOIN) lets us join together two tables and only bring back all records in both where the PRIMARY and FOREIGN keys match. We join ON the PRIMARY key in the parent table and the FOREIGN key in the child table.

The code looks like this:

SELECT Cities.CityID, Cities.CityName, Countries.CountryName 
FROM Countries 
INNER JOIN Cities 
ON Countries.CountryID = Cities.CountryID

And the output would look like this:

CityID CityName CountryName
1 NEW YORK USA
2 LONDON UK
3 MELBOURNE AUSTRALIA
4 PARIS FRANCE

As you can see above, any rows where there is no match between the keys in the two tables are discarded from our results.

LEFT JOIN

The LEFT JOIN (also known as LEFT OUTER JOIN) lets us join tables but insists on including all rows from the first table, even if there is no match in the second table.

We might do this if we had a CUSTOMERS table on the left side and an EMAILS table on the right side. We may not have emails for all of our customers but we still want to see all of the customers in our output.

If we had used an INNER JOIN we would only get back those customers who DO have an email record in the EMAILS table.

Using our COUNTRIES and CITIES tables, the code would look like this:

SELECT Countries.CountryID,
       Countries.CountryName,
       Cities.CityID,
       Cities.CityName
  FROM Countries
       LEFT JOIN
       Cities ON Countries.CountryID = Cities.CountryID

Which gives us the following results:

CountryID CountryName CityID CityName
1 USA 1 NEW YORK
2 UK 2 LONDON
3 CANADA
4 AUSTRALIA 3 MELBOURNE
5 FRANCE 4 PARIS

Note how Canada has no corresponding city against it. As you can see from the tables further up, we don’t have a Canadian city in our Cities table but the LEFT JOIN brings all Countries back regardless. Just as we wanted.

RIGHT JOIN

The RIGHT JOIN (or RIGHT OUTER JOIN) lets us do what we did with LEFT JOIN, only in the opposite direction. It keeps all records from the second table even if there is no match between our keys into the first table.

We write it like this:

SELECT Countries.CountryID,
       Countries.CountryName,
       Cities.CityID,
       Cities.CityName
  FROM Countries
       RIGHT JOIN
       Cities ON Countries.CountryID = Cities.CountryID

And it gives us the following results:

CountryID CountryName CityID CityName
1 USA 1 NEW YORK
2 UK 2 LONDON
4 AUSTRALIA 3 MELBOURNE
5 FRANCE 4 PARIS
5 ROME

This time we see it is ROME from our Cities table which does not have a corresponding Country to match to in the Countries table. But the RIGHT JOIN brings it into the results regardless. (You’ll also see that Canada drops out from our Countries again as in the INNER JOIN example due to having no City in the Cities table).

CROSS JOIN

Beware. The CROSS JOIN will join together all rows from both tables and doesn’t need a key to do it. Yes, I know I said above that we need PRIMARY and FOREIGN keys for SQL JOINs. The CROSS JOIN is an exception that proves the rule.

You might be wondering where we would use it so this is a good example. Let’s say you run an e-commerce clothing store and need every combination of t-shirt size and colour that could possibly be ordered. You have a SIZES table and a COLOURS table.

Sizes are: S M L XL

Colours are: Red Blue Black

The SQL would look like this:

SELECT Sizes.SizeTitle,
       Colours.ColourName
  FROM Sizes
       CROSS JOIN
       Colours

And the output table would look like this:

SizeTitle ColourName
S RED
S BLUE
S BLACK
M RED
M BLUE
M BLACK
L RED

Etc., etc. , etc. until all of our Sizes and all of our Colours had been matched up. For our example, that would mean 12 rows (4 Sizes x 3 Colours).

Really do be careful using the CROSS JOIN though. It has the propensity to deliver MASSIVE tables if you forget yourself and join big tables on either side of the union. If you had one table with 1,000 rows and another table with 1,000 rows, that would generate a CROSS JOINed table of 1 million rows. Don’t say I didn’t warn you…

UNION

While the other joins above were used to link two tables sideways, the UNION join allows us to combine them by setting one table on top of another.

We use our trusty old friend, the SQL statement, to pull two recordsets from the tables and use UNION to do the “setting on top of”.

The two tables need to follow these rules however:

  • Each SELECT statement MUST have the same number of output columns. You can’t SELECT a, b FROM TABLE 1 and only SELECT a FROM TABLE2.
  • The datatypes in the columns have to be the same. Can’t have a text string column from one table and a column of numbers from another.
  • The columns in both SELECT statements have to be in the same order. Don’t have SELECT colA, colB in the first SELECT then SELECT colB, colA in the second.

The UNION join will then stack both datasets together and remove any duplicates rows.

Let’s see an example.

First the data tables, our trusty Cities table from above:

CityID CityName CountryID
1 NEW YORK 1
2 LONDON 2
3 MELBOURNE 4
4 PARIS 5
5 ROME

And a table of cities that have held multiple Summer Olympic Games. We’ll call it Olympics:

CityName
LONDON
TOKYO
LOS ANGELES
PARIS
ATHENS

Our SQL code for this UNION join would be:

SELECT CityName from Cities 
UNION
SELECT CityName FROM Olympics

And the output:

CityName
NEW YORK
LONDON
MELBOURNE
PARIS
ROME
TOKYO
LOS ANGELES
ATHENS

You’ll notice that London and Paris appear in both tables but our UNION join removes the duplicate entries.

Another variation on the UNION join is to use UNION ALL.

While UNION removes duplicates from it’s output, UNION ALL doesn’t. So if the record exists in the first SELECT statement and the second, they will both appear in your results. Just be careful there if you need to watch out for de-duplication.

UNION ALL SQL code looks like this:

SELECT CityName from Cities 
UNION ALL
SELECT CityName FROM Olympics

And our example output (now also including our duplicates) looks like this:

CityName
NEW YORK
LONDON
MELBOURNE
PARIS
ROME
LONDON
TOKYO
LOS ANGELES
PARIS
ATHENS

London and Paris are back in there twice along with all of the other cities in both our Cities table and our Olympics table.

WHAT’S NEXT?

JOINs are incredibly powerful but are often a cause of confusion amongst data analysts. Even remembering which particular join returns which records from which side of the join is difficult. Add in more than two tables, multiple columns and a complex filter in the WHERE clause and the potential for making a mess of if it can be high.

Planning your SQL query is key.

Know what you want to get and from where. Build the query up piece by piece if necessary and check your results regularly. And get lots of practice writing them. That is the only way to truly master the art of the SQL JOIN.

For anyone coming from an Excel Jockey background, you might see some comparisons with the VLOOKUP function but hopefully these examples show just how much more powerful a little bit of SQL code can be. We will look at more advanced JOINs and bring in the WHERE clause to them in a later part of this Learning SQL For Data Analysis series.

Check out the next lesson from my Learn SQL For Data Analysis series.

Want articles like this in your inbox each week?

Straight talking, real world analytics stories, lessons and tips. Don't miss out.