# NOTES | Foundations of Querying

## Foundations of Querying

T-SQL is the main language used to manage and manipulate data in Microsoft’s main relational database management system (RDBMS), SQL Server—whether on premises or in the cloud (Microsoft Windows Azure SQL Database). It all stems from the mathematical principals of set theory and predicate logic.

T-SQL is a dialect of standard SQL and therefore the core elements look the same. However, all leading database vendors decides which features to implement and which to not. Also, each vendor usually implements extensions to the standard in cases where the vendor feels an important feature is not covered by the standard.

Writing in a standard way is considered best practice.

Standard SQL is based on the relational model, which is a mathematical model for data management and manipulation. The relational model was initially created and proposed by Edgar F. Codd in 1969. Since then, it has been explained and developed by Chris Date, Hugh Darwen, and others.

A common misconception is that the name “relational” has to do with relationships between tables (that is, foreign keys). Actually, the true source for the model’s name is the mathematical concept relation. A relation in the relational model is what SQL calls a table. The two are not synonymous.

SQL attempts to represent with a table: a relation has a heading and a body.

• The heading is a set of attributes (what SQL attempts to represent with columns), each of a given type. An attribute is identified by name and type name.
• The body is a set of tuples (what SQL attempts to represent with rows). Each tuple’s heading is the heading of the relation. Each value of each tuple’s attribute is of its respective type.

What this translates to in T-SQL is that you’re supposed to write queries that interact with the tables as a whole.

Some of the most important principals to understand about T-SQL stem from the relational model’s core foundations—set theory and predicate logic. Remember the heading of a relation is a set of attributes, and the body a set of tuples.

So what is a set?

According to the creator of mathematical set theory, Georg Cantor, a set is described as follows:

By a “set” we mean any collection M into a whole of definite, distinct objects m (which are called the “elements” of M) of our perception or of our thought.

—George Cantor, in “Georg Cantor” by Joseph W. Dauben

(Princeton University Press, 1990)

A set should be considered as a whole. This means that you do not interact with the individual elements of the set, rather with the set as a whole. Notice the term distinct—a set has no duplicates. A set has no duplicates and the set {a, a, b} is considered to be the same as the set {a, b}. There is also no order to a set and {b, a, c} is considered to be the same as {a, b, c}.

Another fundamental aspect of a set is that there’s no relevance to the order of the elements. For this reason, rows in a table have no particular order, conceptually. So when you issue a query against a table and don’t indicate explicitly that you want to return the rows in particular presentation order, the result is supposed to be relational. Therefore, you shouldn’t assume any specific order to the rows in the result, never mind what you know about the physical representation of the data, for example, when the data is indexed.

The other branch of mathematics that the relational model is based on is called predicate logic. A predicate is an expression, that when attributed to some object, makes a proposition either true or false (two-valued logic) following the law of excluded middle in mathematical logic. For example, “salary greater than \$50,000” is a predicate. You can evaluate this predicate for a specific employee, in which case you have a proposition. For example, suppose that for a particular employee, the salary is \$60,000. When you evaluate the proposition for that employee, you get a true proposition.

The relational model uses predicates as one of its core elements. You can enforce data integrity by using predicates. You can filter data by using predicates. You can even use predicates to define the data model itself.

Then, there is the concept of NULLs in SQL. Codd, the creator of the relational model, believed in the idea of supporting missing values, and predicates that extend beyond the two-valued logic.

For detailed information about the differences between SQL and the relational model and how to use SQL in a relational way, see SQL and Relational Theory, Second Edition by C. J. Date (O’Reilly Media, 2011).

It’s an excellent book that all database practitioners should read.

### Using Correct Terminology

One last point of the chapter is the proper use of T-SQL terminology and incorrect terms.

As an example of incorrect terms in T-SQL, people often use the terms “field” and “record” to refer to what T-SQL calls “column” and “row,” respectively. Fields and records are physical. Fields are what you have in user interfaces in client applications, and records are what you have in files and cursors. Tables are logical, and they have logical rows and columns.

Another example of an incorrect term is referring to “NULL values.” A NULL is a mark for a missing value—not a value itself. Hence, the correct usage of the term is either “NULL mark” or just “NULL.”

#### Database Concepts

Databases

• a collection of objects

Tables

• Store data
• Core component to all databases

#### Database Objects

Objects work with table data

• Views are virtual tables
• simply structure
• Stored Procedures are commands that can be executed
• make changes
• return data
• Functions are commands that can be executed
• return data

## Understanding Logical Query Processing

### T-SQL as a Declarative English-Like Language

T-SQL, being based on standard SQL, is a declarative English-Like language meaning you define what you want, as opposed to imperative languages that define also how to achieve what you want.

### Logical Query Processing Phases

The main statement used to retrieve data in T-SQL is the SELECT statement. Following the main query clauses specified in the order that you are supposed to type them (known as “keyed-in order”):

1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY

But, the logical query processing order, which is conceptual interpretation order, is different. It starts with the FROM clause. Here is the logical query processing order of the six main query clauses:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY

Each phase operates on one of more tables as inputs and returns a virtual table as an output. The output table of one phase is considered the input to the next phase. This is in accord with operations on relations that yield a relation. NOTE: If an ORDER BY is specified, the result is NOT relational.

#### Evaluate the FROM Clause

In this first phase, the FROM clause is evaluated. This is where you indicate the tables you want to query and the tables operators like joins if applicable. If you need to query just one table, you indicate the table name as the input table in this clause. Then, the output of this phase is a table result with all rows from the input table.

#### Filter Rows Based on the WHERE Clause

The second phase filters rows based on the predicate in the WHERE clause. Only rows for which the predicate evaluates to true are returned.

EXAM TIP

Rows for which the predicate evaluates to false, or evaluates to an unknown state, are not returned.

#### Group Rows Based on the GROUP BY clause

This phase defines a group for each distinct combination of values in the grouped elements from the input table. It then associates each input row to its respective group.

#### Filter Rows Based on the HAVING Clause

This phase is also responsible for filtering data based on a predicate, but it is evaluated after the data has been grouped; hence, it is evaluated per group and filters groups as a whole. As is usual in T-SQL, the filtering predicate can evaluate to true, false, or unknown.

#### Process the SELECT Clause

The fifth phase is the one responsible for processing the SELECT clause. What’s interesting about it is the point in logical query processing where it gets evaluated – almost last.

This phase includes two main steps. The first step is evaluating the epressions in the SELECT list and producing the result attributes. This includes assigning attributes with names if they are derived from expressions. Remember that if a query is a grouped query, each group is represented by a single row in the result. in the query, two gropus remain after the processing of the HAVING filter. Therefore, this step generates two rows.

The second step in this phase is applicable IF you indicate the DISTINCT clause, in which case this step removes duplicates. Remember: T-SQL is based on multi-set theory more than it is on set theory, and therefore, if duplicates are possible in the result, it’s your responsibility to remove those with the DISTINCT clause.

The fifth phase returns a relational result. Therefore, the order of the rows isn’t guaranteed. The ORDER BY clause guarantees the order in the result, but this will be discussed when the next phase is described. What’s important to note is that the outcome of the phase that processes the SELECT clause is still relational. Also, remember this phase assigns column aliases.

#### Handle Presentation Ordering

The sixth phase is applicable if the query has an ORDER BY clause. This phase is responsible for returning the result in a specific presentation order according to the expressions that appear in the ORDER BY list.

Unlike in previous phases where the result was relational, the output of this phase is NOT relational because it has a guaranteed order. This result of this phase is what standard SQL calls a cursor. Note that the use of the term cursor here is conceptual. T-SQL also supports an object called a cursor that is defined based on a result of a query, and that allows fetching rows one at a time in a specific order.

NOTE: Such processing is NOT relational. If you process the query result in a relational manner, the result will be relational. Also, sorting data can add cost to the query processing so avoid this unnecessary cost if you do not care about the order in which the result rows are returned.