# Advanced SQL Server masterclass for data analytics

I thought this course gave a good overview of topics that aren't typically covered in a basic SQL course. There were also plenty of practical examples and exercises.

Some of the topics covered:

* Window functions
* Correlated subqueries
* EXISTS and NOT EXISTS
* PIVOT
* Generating data series with Recursive CTEs
* Using CTEs and temporary tables to break processes into manageable steps
* Defining and manipulating tables with DDL and DML
* Lookup tables
* SQL optimization techniques
* Procedural programming techniques like variables and IF statements
* User-defined functions
* Stored procedures
* Dynamic SQL

## Notes

**Window functions** perform calculations across a set of rows that are related to the current row. They operate on a window of data, which is a set of rows defined by a *partition* and an *ordering*. This allows for things like running totals, rankings, and moving averages without the need for self-joins or subqueries.

**Scalar** subqueries return a single value, while **table-valued** subqueries return multiple rows and columns. (You can also have column subqueries and row subqueries.)

**Correlated** subqueries depend on the outer query for their values. They are executed once for each row processed by the outer query.

**Common table expressions (CTEs)** are named temporary result sets. They can improve the readability and maintainability of complex queries by breaking them down into simpler parts. **Recursive** CTEs are useful for generating sequences.

You can also store intermediate steps in **temporary tables**. These allow you to persist the data for the duration of the session or to apply an index. They also help with debugging as you can check the values directly, but they can't be used recursively. You would use temporary tables over CTEs for large datasets or in cases where the same intermediate result is used multiple times/across multiple queries.

An **index** is a data structure that allows the database engine to locate rows more easily. While indexes can significantly speed up SELECT queries, they can slow down INSERT, UPDATE, and DELETE operations.

A **clustered** index determines the order in which the records are physically stored. This makes them useful for [range queries](#user-content-fn-1)[^1], but means that each table can only have one clustered index. A **non-clustered** index is stored separately with pointers to the data. This means that you can have multiple such indices and that they have less of an impact on INSERT, UPDATE, and DELETE operations, though they require additional storage.

**Hash** joins use a hash table to find matching rows, while **merge** joins sort both tables and iterate through them simultaneously. This means that merge joins are more memory efficient, but they may be slower if the tables aren't already sorted.

**Concurrency control** is important to ensure data integrity and consistency. Some potential issues:

* **Lost update:** One transaction's changes are overwritten by another.
* **Dirty read:** Reading uncommitted data from another transaction.
* **Non-repeatable read:** Getting different results from repeated reads within a transaction.
* **Phantom read:** New rows appear in a repeated read.

**Two-phase locking** ensures that all locks are acquired before any are released, which guarantees **serializability** but can lead to **deadlocks**. **Optimistic** concurrency control assumes conflicts are rare and checks for conflicts before committing. If a conflict is detected, the transaction is rolled back. **Multiversion** concurrency control keeps multiple copies of the data.

**Views** are virtual tables based on the result set of an SQL statement. They **encapsulate** queries and provide a layer of abstraction over the underlying table structure. They can simplify data access, enforce security by restricting access to certain columns or rows, and provide a consistent interface to the data even if the underlying table structure changes. **Materialized** (or **indexed**) views store the results, which can be useful from a performance perspective. **Updatable** views allow INSERT, UPDATE, and DELETE statements.

An **execution plan** is a step-by-step breakdown of how the database engine will execute a given query. It shows the operations performed, their order, and the estimated cost of each step, which is useful for query optimization.

**Variables** can be used to store intermediate results, control program flow, or parameterize queries. They are particularly useful in stored procedures and functions for creating more dynamic and flexible code.

**User-defined functions (UDFs)** are custom functions that encapsulate commonly used logic or calculations. They promote code reuse, improve maintainability, and can simplify complex queries, but can sometimes impact query performance.

**Stored procedures** are precompiled collections of SQL statements with optional control-flow statements.

> Functions are more about "getting something", while stored procedures are more about "doing something".

**Dynamic** SQL statements are constructed at runtime, often by concatenating strings. This allows for more flexible queries that can adapt based on user input or changing conditions, but can make queries harder to debug and maintain. It is important to parameterize them or apply input validation in order to prevent SQL injection attacks. Dynamic SQL is often used in scenarios where the exact structure of a query isn't known until runtime.

## Additional comments from Claude

I asked Claude 3.5 Sonnet if I'd missed anything or got anything wrong. Its suggestions are below. I don't recall covering indexes being mentioned in the course, but it did discuss the other points.

1. Subqueries: You could add that scalar subqueries can be used in SELECT, WHERE, and HAVING clauses, while table-valued subqueries are often used in FROM clauses or with IN operators.
2. Indexes: It might be worth mentioning that covering indexes (which include all columns needed for a query) can be particularly beneficial for performance.
3. Concurrency Control: You could add that these issues are often managed through different isolation levels in database transactions.
4. Views: It's worth noting that not all views are updatable; it depends on the complexity of the view definition.
5. Variables, UDFs, and Stored Procedures: It's worth noting that the exact implementation and capabilities can vary between different database systems.

[^1]: These retrieve data within a specific range of values. Efficient indexing has a significant impact on their performance.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.raoulharris.com/technical-courses/advanced-sql-server-masterclass-for-data-analytics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
