NOTES | Combining Sets

NOTES | Combining Sets

Using Joins

Often, data that you need to query is spread across multiple tables. The more normalized the environment is, the more tables you usually have. The tables are usually related through keys, such as a foreign key in one side and a primary key in the other. Then you can use joins to query the data from the different tables and match the rows that need to be related.

Cross Joins

A cross join is the simplest type of join, though not the most commonly used one. This join performs what’s known as a Cartesian product of the two input tables. In other words, it performs a multiplication between the tables, yielding a row for each combination of rows from both sides.

A CROSS JOIN is a Cartesian product of TableA and TableB.
Every row from TableA is matched with every row from TableB.

According to logical query processing, the first step in the processing of the query is evaluating the FROM clause. The cross join operates in the FROM clause, performing a Cartesian product between the two instances.

Fortunately, SQL Server doesn’t have to follow logical query processing literally as long as it can return the correct result. That’s what optimization is all about—returning the result as fast as possible. SQL Server knows that with a cross join followed by a filter it can evaluate the filters first (which is especially efficient when there are indexes to support the filters), and then match the remaining rows.

Inner Joins

With an inner join, you can match rows from two tables based on a predicate—usually one that compares a primary key value in one side to a foreign key value in another side.

An INNER JOIN merges ONLY the matching rows in BOTH tables.
A JOIN without any other JOIN keywords (like INNER, OUTER, LEFT, etc) is an INNER JOIN.

EXAM TIP
Often, when joining tables, you join them based on a foreign key–unique key relationship. For example, there’s a foreign key defined on the supplierid column in the Production. Products table (the referencing table), referencing the primary key column supplierid in the Production.Suppliers table (the referenced table). It’s also important to note that when you define a primary key or unique constraint, SQL Server creates a unique index on the constraint columns to enforce the constraint’s uniqueness property. But when you define a foreign key, SQL Server doesn’t create any indexes on the foreign key columns. Such indexes could improve the performance of joins based on those relationships. Because SQL Server doesn’t create such indexes automatically, it’s your responsibility to identify the cases where they can be useful and create them. So when working on index tuning, one interesting area to examine is foreign key columns, and evaluating the benefits of creating indexes on those.

A very common question is, “What’s the difference between the ON and the WHERE clauses, and does it matter if you specify your predicate in one or the other?” The answer is that for inner joins it doesn’t matter. Both clauses perform the same filtering purpose. Both filter only rows for which the predicate evaluates to true and discard rows for which the predicate evaluates to false or unknown. In terms of logical query processing, the WHERE is evaluated right after the FROM, so conceptually it is equivalent to concatenating the predicates with an AND operator. SQL Server knows this, and therefore can internally rearrange the order in which it evaluates the predicates in practice, and it does so based on cost estimates.

For many people, though, it’s intuitive to specify the predicate that matches columns from both sides in the ON clause, and predicates that filter columns from only one side in the WHERE clause. But again, with inner joins it doesn’t matter. In the discussion of outer joins in the next section, you will see that, with those, ON and WHERE play different roles; you need to figure out, according to your needs, which is the appropriate clause for each of your predicates.

As with cross joins, both standard SQL and T-SQL support an older syntax for inner joins where you specify a comma between the table names, and then all predicates in the WHERE clause. But as mentioned, it is considered a best practice to stick to the newer syntax with the JOIN keyword. When using the older syntax, if you forget to indicate the join predicate, you end up with an unintentional cross join. When using the newer syntax, an inner join isn’t valid syntactically without an ON clause, so if you forget to indicate the join predicate, the parser will generate an error.

Because an inner join is the most commonly used type of join, the standard decided to make it the default in case you specify just the JOIN keyword.

Outer Joins

With outer joins, you can request to preserve all rows from one or both sides of the join, never mind if there are matching rows in the other side based on the ON predicate.
By using the keywords LEFT OUTER JOIN (or LEFT JOIN for short), you ask to preserve the left table. The join returns what an inner join normally would—that is, matches (call those inner rows). In addition, the join also returns rows from the left that had no matches in the right table (call those outer rows), with NULLs used as placeholders in the right side.

FULL OUTER JOIN returns matched and unmatched rows from both tables (it’s an union of both).
If there is no match, the missing side will contain null.

It is very important to understand that, with outer joins, the ON and WHERE clauses play very different roles, and therefore, they aren’t interchangeable. The WHERE clause still plays a simple filtering role—namely, it keeps true cases and discards false and unknown cases. In our query, the WHERE clause filters only suppliers from Japan, so suppliers that aren’t from Japan simply don’t show up in the output.

However, the ON clause doesn’t play a simple filtering role; rather, it’s more a matching role. In other words, a row in the preserved side will be returned whether the ON predicate finds a match for it or not. So the ON predicate only determines which rows from the nonpreserved side get matched to rows from the preserved side—not whether to return the rows from the preserved side. In our query, the ON clause matches rows from both sides by comparing their supplier ID values. Because it’s a matching predicate (as opposed to a filter), the join won’t discard suppliers; instead, it only determines which products get matched to each supplier. But even if a supplier has no matches based on the ON predicate, the supplier is still returned. In other words, ON is not final with respect to the preserved side of the join. WHERE is final. So when in doubt whether to specify the predicate in the ON or WHERE clauses, ask yourself: Is the predicate used to filter or match? Is it supposed to be final or nonfinal?

Multi-Join Queries

It’s important to remember that a join in T-SQL takes place conceptually between two tables at a time. A multi-join query evaluates the joins conceptually from left to right. So the result of one join is used as the left input to the next join. If you don’t understand this, you can end up with logical bugs, especially when outer joins are involved. (With inner and cross joins, the order cannot affect the meaning.)

Using Subqueries, Table Expressions, and the APPLY Operator

T-SQL supports nesting of queries. This is a convenient part of the language that you can use to refer to one query’s result from another. You do not need to store the result of one query in a variable in order to be able to refer to that result from another query.

Subqueries

Subqueries can be self-contained—namely, independent of the outer query; or they can be correlated—namely, having a reference to a column from the table in the outer query. In terms of the result of the subquery, it can be scalar, multi-valued, or table-valued.

Self-Contained Subqueries

Self-contained subqueries are subqueries that have no dependency on the outer query. If you want, you can highlight the inner query and run it independently. This makes the troubleshooting of problems with self-contained subqueries easier compared to correlated subqueries.

Subqueries that return a single value, or scalar subqueries, can be used where a single-valued expression is expected, like in one side of a comparison.

Note that if what’s supposed to be a scalar subquery returns in practice more than one value, the code fails at run time. If the scalar subquery returns an empty set, it is converted to a NULL.
A subquery can also return multiple values in the form of a single column. Such a subquery can be used where a multi-valued result is expected—for example, when using the IN predicate.

Correlated Subqueries

Correlated subqueries are subqueries where the inner query has a reference to a column from the table in the outer query. They are trickier to work with compared to self-contained subqueries because you can’t just highlight the inner portion and run it independently.

Table Expressions

Table expressions are named queries. You write an inner query that returns a relational result set, name it, and query it from an outer query. T-SQL supports four forms of table expressions:

  • Derived tables
  • Common table expressions (CTEs)
  • Views
  • Inline table-valued functions

The first two are visible only to the statement that defines them. As for the last two, you preserve the definition of the table expression in the database as an object; therefore, it’s reusable, and you can also control access to the object with permissions.
Note that because a table expression is supposed to represent a relation, the inner query defining it needs to be relational. This means that all columns returned by the inner query must have names (use aliases if the column is a result of an expression), and all column names must be unique. Also, the inner query is not allowed to have an ORDER BY clause.
(Remember, a set has no order.) There’s an exception to the last rule: If you use the TOP or OFFSET-FETCH option in the inner query, the ORDER BY serves a meaning that is not related to presentation ordering; rather, it’s part of the filter’s specification. So if the inner query uses the TOP or OFFSET-FETCH option, it’s allowed to have an ORDER BY clause as well. But then the outer query has no presentation ordering guarantees if it doesn’t have its own ORDER BY clause.

IMPORTANT Optimization of Table Expressions
It’s important to note that, from a performance standpoint, when SQL Server optimizes queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.

Derived Tables

A derived table is probably the form of table expression that most closely resembles a subquery—only a subquery that returns an entire table result. You define the derived table’s inner query in parentheses in the FROM clause of the outer query, and specify the name of the derived table after the parentheses.

CTEs

A common table expression (CTE) is a similar concept to a derived table in the sense that it’s a named table expression that is visible only to the statement that defines it. Like a query against a derived table, a query against a CTE involves three main parts:

  • The inner query
  • The name you assign to the query and its columns
  • The outer query

However, with CTEs, the arrangement of the three parts is different. Recall that with derived tables the inner query appears in the FROM clause of the outer query—kind of in the middle of things. With CTEs, you first name the CTE, then specify the inner query, and then the outer query—a much more modular approach.

You don’t nest CTEs like you do derived tables. If you need to define multiple CTEs, you simply separate them by commas. Each can refer to the previously defined CTEs, and the outer query can refer to all of them. After the outer query terminates, all CTEs defined in that WITH statement are gone. The fact that you don’t nest CTEs makes it easier to follow the logic and therefore reduces the chances for errors. For example, if you want to refer to one CTE from another, you can use the following general form.

Because the CTE name is assigned before the start of the outer query, you can refer to multiple instances of the same CTE name, unlike with derived tables.

CTEs also have a recursive form. The body of the recursive query has two or more queries, usually separated by a UNION ALL operator. At least one of the queries in the CTE body, known as the anchor member, is a query that returns a valid relational result. The anchor query is invoked only once. In addition, at least one of the queries in the CTE body, known as the recursive member, has a reference to the CTE name. This query is invoked repeatedly until it returns an empty result set. In each iteration, the reference to the CTE name from the recursive member represents the previous result set. Then the reference to the CTE name from the outer query represents the unified results of the invocation of the anchor member and all invocations of the recursive member.

Views and Inline Table-Valued Functions

Derived tables and CTEs are table expressions that are visible only in the scope of the statement that defines them. After that statement terminates, the table expression is gone. Hence, derived tables and CTEs are not reusable. For reusability, you need to store the definition of the table expression as an object in the database, and for this you can use either views or inline table-valued functions. Because these are objects in the database, you can control access by using permissions.
The main difference between views and inline table-valued functions is that the former doesn’t accept input parameters and the latter does. As an example, suppose you need to persist the definition of the query with the row number computation from the examples in the previous sections.

As for inline table-valued functions, they are very similar to views in concept; however, as mentioned, they do support input parameters. So if you want to define something like a view with parameters, the closest you have is an inline table-valued function.

APPLY

The APPLY operator is a powerful operator that you can use to apply a table expression given to it as the right input to each row from a table expression given to it as the left input. What’s interesting about the APPLY operator as compared to a join is that the right table expression can be correlated to the left table; in other words, the inner query in the right table expression can have a reference to an element from the left table. So conceptually, the right table expression is evaluated separately for each left row. This means that you can replace the use of cursors in some cases with the APPLY operator.

CROSS APPLY

The CROSS APPLY operator operates on left and right table expressions as inputs. The right table expression can have a correlation to elements from the left table. The right table expression is applied to each row from the left input. What’s special about the CROSS APPLY operator as compared to OUTER APPLY is that if the right table expression returns an empty set for a left row, the left row isn’t returned.

OUTER APPLY

The OUTER APPLY operator does what the CROSS APPLY operator does, but also includes in the result rows from the left side that get an empty set back from the right side. NULLs are used as placeholders for the result columns from the right side. In other words, the OUTER APPLY operator preserves the left side. In a sense, the difference between OUTER APPLY and CROSS APPLY is similar to the difference between a LEFT OUTER JOIN and an INNER JOIN.

Using Set Operators

Set operators operate on two result sets of queries, comparing complete rows between the results. Depending on the result of the comparison and the set operator used, the operator determines whether to return the row or not. T-SQL supports three set operators:

  • UNION
  • INTERSECT
  • EXCEPT
  • it also supports one multi-set operator: UNION ALL.

Working with set operators follows a number of guidelines:

  • Because complete rows are matched between the result sets, the number of columns in the queries has to be the same and the column types of corresponding columns need to be compatible (implicitly convertible).
  • Set operators consider two NULLs as equal for the purpose of comparison. This is quite unusual when compared to filtering clauses like WHERE and ON.
  • Because the operators are set operators and not cursor operators, the individual queries are not allowed to have ORDER BY clauses.
  • You can optionally add an ORDER BY clause that determines presentation ordering of the result of the set operator.
  • The column names of result columns are determined by the first query.

The function of SQL set operators is pretty simple; they allow us to combine results from different SQL queries into one result set.

The best way to understand how set operators work is to use visual methods like the Venn diagram. For those of you not familiar with Venn diagrams, they are two circles that represent items or collections of items. The area where the circles overlap shows what the two groups have in common.

UNION and UNION ALL

The UNION set operator unifies the results of the two input queries. As a set operator, UNION has an implied DISTINCT property, meaning that it does not return duplicate rows.

The UNION operator takes two sets and returns the union of those sets.

The important thing here is that if the same person is in both tables the UNION statement will display only one instance of that person.

UNION is closely related to UNION. The only difference is that  UNION ALL keeps the number of records as found in all tables.

IMPORTANT UNION vs. UNION ALL
If the sets you’re unifying are disjoint and there’s no potential for duplicates, UNION and UNION ALL will return the same result. However, it’s important to use UNION ALL in such a case from a performance standpoint because with UNION, SQL Server may try to eliminate duplicates, incurring unnecessary cost.

Note:
In terms of execution time, UNION ALL is much faster than  UNION. This is because UNION sorts the data to eliminate any duplicates and sorting takes time. UNION ALL does not sort data.

INTERSECT

The INTERSECT operator returns only distinct rows that are common to both sets. In other words, if a row appears at least once in the first set and at least once in the second set, it will appear once in the result of the INTERSECT operator.

The Venn diagram for this kind of question looks like this (the darker part is where the intersection occurs):

EXCEPT

The EXCEPT operator performs set difference. It returns distinct rows that appear in the first query but not the second. In other words, if a row appears at least once in the first query and zero times in the second, it’s returned once in the output.

The EXCEPT statement takes things away. It reduces the result set from the first table by taking away records found in the second table. EXCEPT tells us who or what is not in a table. For example, it would answer the question “Who is a celebrity but not a politician?”

The Venn diagram of the EXCEPT statement for our politicians (A) and celebrities (B) sets would look something like this:

With UNION and INTERSECT, the order of the input queries doesn’t matter. However, with EXCEPT, there’s different meaning to EXCEPT vs. EXCEPT .

Finally, set operators have precedence: INTERSECT precedes UNION and EXCEPT, and UNION and EXCEPT are considered equal.

Leave a Reply