NOTES | Filtering and Sorting Data

NOTES | Filtering and Sorting Data

While the traditional way to filter data in T-SQL is based on predicates, it also supports filtering data based on a specified number of rows and ordering. To do so, use TOP and OFFSET.

Filtering Data with Predicates

ON, WHERE, and HAVING are the three query clauses which enable your filtering of data based on predicates as supported by T-SQL.

Predicates, Three-Valued Logic, and Search Arguments

Not all filtering aspects are obvious, but you need to understand how to form a predicate to maximize query efficiency and how predicates interact with NULLs; for this, you will need to be familiar with the concept of search argument.

Recall from “Foundations of Querying,” that a predicate is a logical expression. When NULLs are not possible in the data, the predicate can evaluate to true or false. The type of logic used in such a case is known as two-valued logic. The WHERE filter returns only the rows for which the predicate evaluates to true.

However, when NULLs are possible in the data, things get trickier. For example, on some locations, three points of data are possible, whereas it may not be the case elsewhere in the world. Example:

Country: USA

Region: WA

City: Seattle

Country: UK

Region: NULL

City: London

When NULLs are possible in the data, a predicate can evaluate to true, false, and unknown. This type of logic is known as three-valued logic. When using an equality operator in the predicate like in the previous query, you get true when both operands are not NULL and equal; for example, WA and WA. You get false when both are not NULL and different; for example, OR and WA. So far, it’s straightforward. The tricky part is when NULL marks are involved. You get an unknown when at least one operand is NULL; for example, NULL and WA, or even NULL and NULL.

The WHERE filter returns rows for which the predicate evaluates to true,
meaning that it discards both false and unknown cases. Therefore, the query returns only employees where the region is not NULL and equal to WA.

As it turns out, all of the employees that aren’t from Washington State are from the UK; recall that the region for places in the UK is set to NULL to indicate that it’s inapplicable. Even though it may be clear to you that someone from the UK isn’t from Washington State, it’s not clear to T-SQL. To T-SQL, a NULL represents a missing value that could be applicable, and
could be WA just like it could be anything else. So it cannot conclude with certainty that the region is different from WA.

Query filters have an important performance side to them. For one thing, by filtering rows in the query (as opposed to in the client), you reduce network traffic. Also, based on the query filters that appear in the query, SQL Server can evaluate the option to use indexes to get to the data efficiently without requiring a full scan of the table. It’s important to note, though, that the predicate needs to be of a form known as a search argument (SARG) to allow efficient use of the index.

A predicate in the form column operator value or value operator column can be a search argument. Applying manipulation to the filtered column in most cases prevents the predicate from being a search argument.

Exam Tip

Understanding the impact of using COALESCE and ISNULL on performance is an important skill for the exam.

Combining Predicates

You can combine predicates in the WHERE clause by using the logical operators AND and OR. You can also negate predicates by using the NOT logical operator. This section starts by describing important aspects of negation and then discusses combining predicates.
Negation of true and false is straightforward—NOT true is false, and NOT false is true.
What can be surprising to some is what happens when you negate unknown—NOT unknown is still unknown.

As for combining predicates, there are several interesting things to note. Some precedence rules determine the logical evaluation order of the different predicates. The NOT operator precedes AND and OR, and AND precedes OR.

Filtering Character Data

In many respects, filtering character data is the same as filtering other types of data. This section covers a couple of items that are specific to character data: proper form of literals and the LIKE predicate.

If you write an expression that involves operands of different types, SQL Server will have to apply implicit conversion to align the types. Depending on the circumstances, implicit conversions can sometimes hurt performance. It is important to know the proper form of literals of different types and make sure you use the right ones. A classic example for using incorrect literal types is with Unicode character strings (NVARCHAR and NCHAR types). The right form for a Unicode character string literal is to prefix the literal with a capital N and delimit the literal with single quotation marks; for example, N’literal’. For a regular character string literal, you just delimit the literal with single quotation marks; for example, ‘literal’. It’s a very typical bad habit to specify a regular character string literal when the filtered column is of a Unicode type.

T-SQL provides the LIKE predicate, which you can use to filter character string data (regular and Unicode) based on pattern matching. The LIKE predicate supports wildcards that you can use in your patterns.

Filtering Date and Time Data

There are several important considerations when filtering date and time data that are related to both the correctness of your code and to its performance. You want to think of things like how to express literals, filter ranges, and use search arguments.

First, let’s start with literals. Date formatting from culture to culture can be different; the format 02/12/07 probably means to American’s February 12, 2007 whereas to Japanese this format probably means December 2, 2007. The question is, when SQL Server converts this character string to a date and time type to align it with the filtered column’s type, how does it interpret the value?

As it turns out, it depends on the language of the logon that runs the code. Each logon has a default language associated with it, and the default language sets various session options on the logon’s behalf, including one called DATEFORMAT.

One approach to deal with this issue is to use a form that is considered language-neutral.
For example, the form ‘20070212’ is always interpreted as ymd, regardless of your language.
Note that the form ‘2007-02-12’ is considered language-neutral only for the data types DATE, DATETIME2, and DATETIMEOFFSET. Unfortunately, due to historic reasons, this form is considered language-dependent for the types DATETIME and SMALLDATETIME. The advantage of the form without the separators is that it is language-neutral for all date and time types.

Note Storing Dates in a DATETIME Column
The filtered column orderdate is of a DATETIME data type representing both date and time.
Yet the literal specified in the filter contains only a date part. When SQL Server converts the literal to the filtered column’s type, it assumes midnight when a time part isn’t indicated.
If you want such a filter to return all rows from the specified date, you need to ensure that you store all values with midnight as the time.

Another approach is to use the CONVERT or PARSE functions, which you can use to indicate how you want SQL Server to interpret the literal that you specify. The CONVERT function supports a style number representing the conversion style, and the PARSE function supports indicating a culture name.

Understanding When Order Is Guaranteed

The only way to truly guarantee that the rows are returned from a query in a certain order is by adding an ORDER BY clause.

If you don’t indicate a direction for sorting, ascending order is assumed by default. You can be explicit and specify city ASC, but it means the same thing as not indicating the direction. For descending ordering, you need to explicitly specify DESC.

Just like a query without an ORDER BY clause doesn’t guarantee order among result rows in general, a query with ORDER BY city, when city isn’t unique, doesn’t guarantee order among rows with the same city. Fortunately, you can specify multiple expressions in the ORDER BY list, separated by commas. One use case of this capability is to apply a tiebreaker for ordering. For example, you could define empid as the secondary sort column, as follows.

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA'
ORDER BY city, empid;

The ORDER BY list is now unique; hence, the ordering is deterministic. If the underlying data doesn’t change, the results are guaranteed to be repeatable, in addition to their presentation ordering. You can indicate the ordering direction on an expression-by-expression basis, as in ORDER BY col1 DESC, col2, col3 DESC (col1 descending, then col2 ascending, then col3 descending).

With T-SQL, you can sort by ordinal positions of columns in the SELECT list, but it is considered a bad practice.

So, remember, a query without an ORDER BY clause returns a relational result (at least from an ordering perspective), and hence doesn’t guarantee any order. The only way to guarantee order is with an ORDER BY clause. According to standard SQL, a query with an ORDER BY clause conceptually returns a cursor and not a relation.

Suffice it to say that creating the right indexes can help SQL Server avoid the need to sort the data to address an ORDER BY request. Without good indexes, SQL Server needs to sort the data, and sorting can be expensive, especially when a large set is involved. If you don’t need to return the data sorted, make sure you do not specify an ORDER BY clause, to avoid unnecessary costs.

Filtering Data with TOP and OFFSET-FETCH

T-SQL provides two options to handle mixing filtering and sorting data by a given order and a specified number of rows. One is the proprietary TOP option and the other is the standard SQL OFFSET-FETCH option.

Filtering Data with TOP

Using the TOP option, you can filter a requested number or percent of rows from the query result based on indicated ordering. By specifying the TOP option in the SELECT clause followed by the requested number of rows in parentheses (BIGINT data type). The ordering specification of the TOP filter is based on the same ORDER BY clause that is normally used for presentation ordering.

The following query returns the five most recent orders.

SELECT TOP (5) orderid, orderdate, custid, empid
 FROM Sales.Orders
 ORDER BY orderdate ASC;

Specifying 5 as the number of rows you want to filter, and the orderdate as ASC for the ordering specification, you get the five rows with the oldest order dates ascending to the newest.

Here’s the output of this query.

Note TOP and Parentheses
T-SQL supports specifying the number of rows to filter using the TOP option in SELECT queries without parentheses, but that’s only for backward-compatibility reasons. The correct syntax is with parentheses.

In addition, you can specify a percent of rows to filter instead of a number. To do so, specify a FLOAT value in the range 0 through 100 in the parentheses, and the keyword PERCENT after the parentheses, as follows. The PERCENT option puts a ceiling on the resulting number of rows if it’s not whole.

Filtering Data with OFFSET-FETCH

Like TOP, the OFFSET-FETCH option is a filtering option you can use to filter data based on a specified number of rows and ordering. However, unlike TOP, it is standard, and has a skipping capability, making it useful for ad-hoc paging purposes.

The OFFSET and FETCH clauses appear right after the ORDER BY clause, and in fact, in T-SQL, they require an ORDER BY clause to be present. You first specify the OFFSET clause indicating how many rows you want to skip (0 if you don’t want to skip any); you then optionally specify the FETCH clause indicating how many rows you want to filter.

The ORDER BY clause now plays two roles: one role is determining the presentation ordering in the query and the other is telling the OFFSET-FETCH option which rows it needs to filter.

Once again, in T-SQL, the OFFSET-FETCH option requires an ORDER BY clause to be present. Also, in T-SQL—contrary to standard SQL—a FETCH clause requires an OFFSET clause to be present. So if you do want to filter some rows but skip none, you still need to specify the OFFSET clause with 0 ROWS.

Because the OFFSET-FETCH option is standard and TOP isn’t, in cases where they are logically equivalent, it’s recommended to stick to the former. Remember that OFFSET-FETCH also has an advantage over TOP in the sense that it supports a skipping capability. However, for now, OFFSET-FETCH does not support options like TOP’s PERCENT and WITH TIES.

From a performance standpoint, you should evaluate indexing the ORDER BY columns to support the TOP and OFFSET-FETCH options. Such indexing serves a very similar purpose to indexing filtered columns and can help avoid scanning unnecessary data as well as sorting.

Leave a Reply