Wednesday, January 26, 2022

Common SQL query optimization interview questions and solutions

Here are some common SQL query optimization interview questions.

These examples cover some of the most common and impactful SQL query optimization scenarios you'll encounter in interviews. Remember to explain why the original query is problematic and how the optimized version addresses those issues.


Question 1: Optimize a SELECT * query for performance.

Scenario: You have a Customers table with millions of rows, and you're asked to retrieve customer information.

Original Query:

SQL
SELECT *
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';

Optimization:

Optimized Query 1 (Specific Columns):

SQL
SELECT customer_id, first_name, last_name, email, registration_date
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';

Optimized Query 2 (With Index):

SQL
-- First, ensure an index exists on registration_date
-- If not, you'd recommend adding it:
-- CREATE INDEX idx_customers_registration_date ON Customers (registration_date);

SELECT customer_id, first_name, last_name, email, registration_date
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';

Explanation:

  • SELECT * vs. Specific Columns:
    • Problem with SELECT *: When you select all columns, the database has to retrieve every piece of data for each matching row. If there are many columns, especially large ones (e.g., VARCHAR(MAX), TEXT, BLOB), this increases disk I/O, network traffic between the database server and the application, and memory consumption on both ends.
    • Solution: Always select only the columns you actually need. This reduces the amount of data transferred and processed.
  • Indexing registration_date:
    • Problem without Index: Without an index on registration_date, the database would likely perform a full table scan, reading every row in the Customers table to find those that match the WHERE clause. This is extremely inefficient on large tables.
    • Solution: Creating an index on registration_date allows the database to quickly jump to the relevant range of dates in the index, then retrieve only the necessary rows from the table, significantly speeding up the query.

Question 2: Optimize a query with OR conditions.

Scenario: You need to find orders placed by a specific customer OR orders with a high value.

Original Query:

SQL
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE customer_id = 12345 OR total_amount > 1000.00;

Optimization:

Optimized Query (using UNION ALL with separate WHERE clauses):

SQL
-- Ensure indexes exist on customer_id and total_amount
-- CREATE INDEX idx_orders_customer_id ON Orders (customer_id);
-- CREATE INDEX idx_orders_total_amount ON Orders (total_amount);

SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE customer_id = 12345

UNION ALL

SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 1000.00 AND customer_id <> 12345; -- Exclude already fetched rows

Explanation:

  • Problem with OR: For complex OR conditions, especially when different columns are involved, a single index might not be effectively utilized for both parts of the OR. The optimizer might resort to a full table scan or perform an inefficient "index merge" operation.
  • Solution with UNION ALL: By splitting the query into two separate SELECT statements joined by UNION ALL, each SELECT can potentially use its own dedicated index (customer_id for the first part, total_amount for the second).
    • UNION ALL is preferred over UNION because UNION also performs a distinct operation, which adds overhead. We use customer_id <> 12345 in the second WHERE clause to avoid duplicate rows that would be returned by both conditions if the same order had customer_id = 12345 AND total_amount > 1000.00. If duplicate rows are acceptable, you can omit the AND customer_id <> 12345 and just use UNION ALL.
  • Indexing: Having separate indexes on customer_id and total_amount is crucial for this optimization.

Question 3: Optimize a query with COUNT(DISTINCT ...).

Scenario: You need to count the number of unique customers who placed orders in a specific month.

Original Query:

SQL
SELECT COUNT(DISTINCT customer_id)
FROM Orders
WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01';

Optimization:

Optimized Query (using a subquery or GROUP BY):

SQL
-- Ensure an index on order_date is present:
-- CREATE INDEX idx_orders_order_date ON Orders (order_date);

-- Option 1: Using a subquery for distinct values (often preferred by optimizers for COUNT DISTINCT)
SELECT COUNT(*)
FROM (
    SELECT DISTINCT customer_id
    FROM Orders
    WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01'
) AS DistinctCustomers;

-- Option 2: Using GROUP BY and then counting the groups (can be equally or less efficient depending on DB)
-- SELECT COUNT(customer_id)
-- FROM (
--     SELECT customer_id
--     FROM Orders
--     WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01'
--     GROUP BY customer_id
-- ) AS GroupedCustomers;

Explanation:

  • Problem with COUNT(DISTINCT ...): COUNT(DISTINCT ...) can be expensive because the database needs to collect all customer_id values for the specified order_date range, then sort them or use a hash table to identify unique values before counting. This often involves creating temporary tables or significant memory usage.
  • Solution with Subquery/GROUP BY:
    • The WHERE clause can still use an index on order_date to efficiently filter the initial set of rows.
    • By first selecting DISTINCT customer_id in a subquery, the database can focus on finding unique IDs within the filtered set. Then, COUNT(*) on this smaller, distinct set is usually more efficient than COUNT(DISTINCT ...) directly on the large table.
    • The GROUP BY approach achieves a similar result by grouping by customer_id and then counting the resulting groups. The optimizer might choose a hash aggregate or a sort-based aggregate for the GROUP BY.
  • Indexing (order_date, customer_id) (Composite/Covering Index):
    • For even better performance, a composite index on (order_date, customer_id) could be highly beneficial. This would allow the database to directly retrieve both columns from the index, potentially making the entire operation a "covering index scan" if no other columns are needed, avoiding access to the main table entirely for the DISTINCT operation.

Question 4: Optimize a query with ORDER BY and LIMIT.

Scenario: You need to get the 10 most recent large orders.

Original Query:

SQL
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 500.00
ORDER BY order_date DESC
LIMIT 10;

Optimization:

Optimized Query (with Composite Index):

SQL
-- Recommend adding a composite index for this query:
-- CREATE INDEX idx_orders_amount_date ON Orders (total_amount, order_date DESC);
-- Or for better covering: CREATE INDEX idx_orders_amount_date_id ON Orders (total_amount, order_date DESC, order_id, customer_id);

SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 500.00
ORDER BY order_date DESC
LIMIT 10;

Explanation:

  • Problem without Index: Without an appropriate index, the database would first filter all orders where total_amount > 500.00. If this set is large, it then has to perform a "filesort" (sort the entire intermediate result set in memory or on disk) based on order_date DESC before finally picking the top 10. This sorting can be very expensive.
  • Solution with Composite Index:
    • A composite index on (total_amount, order_date DESC) is ideal.
    • Why (total_amount, order_date DESC)? The WHERE clause first filters on total_amount. If total_amount is the first column in the index, the database can efficiently find rows matching total_amount > 500.00. Critically, within those filtered results, the order_date values are already sorted in descending order (because order_date DESC is specified in the index). This allows the database to directly read the top 10 rows from the index without a separate sort step.
    • Covering Index (Optional but better): If order_id and customer_id are also included in the index (e.g., (total_amount, order_date DESC, order_id, customer_id)), the query can become a "covering index scan," meaning all necessary data can be retrieved directly from the index without accessing the main table, which is the fastest possible scenario.

Question 5: Optimize a query with JOIN and filtering.

Scenario: You need to list all products that have never been ordered.

Original Query:

SQL
SELECT p.product_name
FROM Products p
WHERE p.product_id NOT IN (SELECT o.product_id FROM OrderItems o);

Optimization:

Optimized Query (using LEFT JOIN and IS NULL):

SQL
-- Ensure indexes exist on product_id in both tables:
-- CREATE INDEX idx_products_product_id ON Products (product_id); -- Assumed PK or similar
-- CREATE INDEX idx_orderitems_product_id ON OrderItems (product_id);

SELECT p.product_name
FROM Products p
LEFT JOIN OrderItems oi ON p.product_id = oi.product_id
WHERE oi.product_id IS NULL;

Explanation:

  • Problem with NOT IN (Subquery):
    • The subquery (SELECT o.product_id FROM OrderItems o) might execute for each row in the Products table, leading to a nested loop operation that can be very slow if Products is large.
    • Even if the subquery is materialized (executed once), comparing each product_id from Products against a potentially large list of product_ids from OrderItems can be inefficient, especially if there are NULL values in OrderItems.product_id (though NOT IN with NULL has specific, sometimes tricky, behavior).
  • Solution with LEFT JOIN and IS NULL:
    • Efficiency: This is generally the most efficient way to find "missing" or "unmatched" rows.
    • The LEFT JOIN attempts to match every row from the Products table (p) with rows from OrderItems (oi).
    • If a product_id from Products does not have a corresponding match in OrderItems, then all columns from OrderItems in that joined row will be NULL.
    • The WHERE oi.product_id IS NULL clause then filters for exactly those rows that had no match, effectively giving you products that have never been ordered.
  • Indexing: Indexes on product_id in both Products and OrderItems are critical for the LEFT JOIN to perform efficiently, as they allow the database to quickly find matching product_id values.


No comments:

Post a Comment