Here are some common SQL query optimization interview questions.
These examples cover some of the most common and impactful SQL query
optimization scenarios you'll encounter in interviews. Remember to
explain why the original query is problematic and how the optimized version addresses those issues.
Question 1: Optimize a SELECT *
query for performance.
Scenario: You have a Customers
table with millions of rows, and you're asked to retrieve customer information.
Original Query:
SELECT *
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';
Optimization:
Optimized Query 1 (Specific Columns):
SELECT customer_id, first_name, last_name, email, registration_date
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';
Optimized Query 2 (With Index):
-- First, ensure an index exists on registration_date
-- If not, you'd recommend adding it:
-- CREATE INDEX idx_customers_registration_date ON Customers (registration_date);
SELECT customer_id, first_name, last_name, email, registration_date
FROM Customers
WHERE registration_date >= '2023-01-01' AND registration_date < '2024-01-01';
Explanation:
SELECT *
vs. Specific Columns:- Problem with
SELECT *
: When you select all columns, the database has to retrieve every piece of data for each matching row. If there are many columns, especially large ones (e.g.,VARCHAR(MAX)
,TEXT
,BLOB
), this increases disk I/O, network traffic between the database server and the application, and memory consumption on both ends. - Solution: Always select only the columns you actually need. This reduces the amount of data transferred and processed.
- Problem with
- Indexing
registration_date
:- Problem without Index: Without an index on
registration_date
, the database would likely perform a full table scan, reading every row in theCustomers
table to find those that match theWHERE
clause. This is extremely inefficient on large tables. - Solution: Creating an index on
registration_date
allows the database to quickly jump to the relevant range of dates in the index, then retrieve only the necessary rows from the table, significantly speeding up the query.
- Problem without Index: Without an index on
Question 2: Optimize a query with OR
conditions.
Scenario: You need to find orders placed by a specific customer OR orders with a high value.
Original Query:
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE customer_id = 12345 OR total_amount > 1000.00;
Optimization:
Optimized Query (using UNION ALL
with separate WHERE
clauses):
-- Ensure indexes exist on customer_id and total_amount
-- CREATE INDEX idx_orders_customer_id ON Orders (customer_id);
-- CREATE INDEX idx_orders_total_amount ON Orders (total_amount);
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE customer_id = 12345
UNION ALL
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 1000.00 AND customer_id <> 12345; -- Exclude already fetched rows
Explanation:
- Problem with
OR
: For complexOR
conditions, especially when different columns are involved, a single index might not be effectively utilized for both parts of theOR
. The optimizer might resort to a full table scan or perform an inefficient "index merge" operation. - Solution with
UNION ALL
: By splitting the query into two separateSELECT
statements joined byUNION ALL
, eachSELECT
can potentially use its own dedicated index (customer_id
for the first part,total_amount
for the second).UNION ALL
is preferred overUNION
becauseUNION
also performs a distinct operation, which adds overhead. We usecustomer_id <> 12345
in the secondWHERE
clause to avoid duplicate rows that would be returned by both conditions if the same order hadcustomer_id = 12345
ANDtotal_amount > 1000.00
. If duplicate rows are acceptable, you can omit theAND customer_id <> 12345
and just useUNION ALL
.
- Indexing: Having separate indexes on
customer_id
andtotal_amount
is crucial for this optimization.
Question 3: Optimize a query with COUNT(DISTINCT ...)
.
Scenario: You need to count the number of unique customers who placed orders in a specific month.
Original Query:
SELECT COUNT(DISTINCT customer_id)
FROM Orders
WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01';
Optimization:
Optimized Query (using a subquery or GROUP BY
):
-- Ensure an index on order_date is present:
-- CREATE INDEX idx_orders_order_date ON Orders (order_date);
-- Option 1: Using a subquery for distinct values (often preferred by optimizers for COUNT DISTINCT)
SELECT COUNT(*)
FROM (
SELECT DISTINCT customer_id
FROM Orders
WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01'
) AS DistinctCustomers;
-- Option 2: Using GROUP BY and then counting the groups (can be equally or less efficient depending on DB)
-- SELECT COUNT(customer_id)
-- FROM (
-- SELECT customer_id
-- FROM Orders
-- WHERE order_date >= '2023-03-01' AND order_date < '2023-04-01'
-- GROUP BY customer_id
-- ) AS GroupedCustomers;
Explanation:
- Problem with
COUNT(DISTINCT ...)
:COUNT(DISTINCT ...)
can be expensive because the database needs to collect allcustomer_id
values for the specifiedorder_date
range, then sort them or use a hash table to identify unique values before counting. This often involves creating temporary tables or significant memory usage. - Solution with Subquery/
GROUP BY
:- The
WHERE
clause can still use an index onorder_date
to efficiently filter the initial set of rows. - By first selecting
DISTINCT customer_id
in a subquery, the database can focus on finding unique IDs within the filtered set. Then,COUNT(*)
on this smaller, distinct set is usually more efficient thanCOUNT(DISTINCT ...)
directly on the large table. - The
GROUP BY
approach achieves a similar result by grouping bycustomer_id
and then counting the resulting groups. The optimizer might choose a hash aggregate or a sort-based aggregate for theGROUP BY
.
- The
- Indexing
(order_date, customer_id)
(Composite/Covering Index):- For even better performance, a composite index on
(order_date, customer_id)
could be highly beneficial. This would allow the database to directly retrieve both columns from the index, potentially making the entire operation a "covering index scan" if no other columns are needed, avoiding access to the main table entirely for theDISTINCT
operation.
- For even better performance, a composite index on
Question 4: Optimize a query with ORDER BY
and LIMIT
.
Scenario: You need to get the 10 most recent large orders.
Original Query:
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 500.00
ORDER BY order_date DESC
LIMIT 10;
Optimization:
Optimized Query (with Composite Index):
-- Recommend adding a composite index for this query:
-- CREATE INDEX idx_orders_amount_date ON Orders (total_amount, order_date DESC);
-- Or for better covering: CREATE INDEX idx_orders_amount_date_id ON Orders (total_amount, order_date DESC, order_id, customer_id);
SELECT order_id, customer_id, order_date, total_amount
FROM Orders
WHERE total_amount > 500.00
ORDER BY order_date DESC
LIMIT 10;
Explanation:
- Problem without Index: Without an appropriate index, the database would first filter all orders where
total_amount > 500.00
. If this set is large, it then has to perform a "filesort" (sort the entire intermediate result set in memory or on disk) based onorder_date DESC
before finally picking the top 10. This sorting can be very expensive. - Solution with Composite Index:
- A composite index on
(total_amount, order_date DESC)
is ideal. - Why
(total_amount, order_date DESC)
? TheWHERE
clause first filters ontotal_amount
. Iftotal_amount
is the first column in the index, the database can efficiently find rows matchingtotal_amount > 500.00
. Critically, within those filtered results, theorder_date
values are already sorted in descending order (becauseorder_date DESC
is specified in the index). This allows the database to directly read the top 10 rows from the index without a separate sort step. - Covering Index (Optional but better): If
order_id
andcustomer_id
are also included in the index (e.g.,(total_amount, order_date DESC, order_id, customer_id)
), the query can become a "covering index scan," meaning all necessary data can be retrieved directly from the index without accessing the main table, which is the fastest possible scenario.
- A composite index on
Question 5: Optimize a query with JOIN
and filtering.
Scenario: You need to list all products that have never been ordered.
Original Query:
SELECT p.product_name
FROM Products p
WHERE p.product_id NOT IN (SELECT o.product_id FROM OrderItems o);
Optimization:
Optimized Query (using LEFT JOIN
and IS NULL
):
-- Ensure indexes exist on product_id in both tables:
-- CREATE INDEX idx_products_product_id ON Products (product_id); -- Assumed PK or similar
-- CREATE INDEX idx_orderitems_product_id ON OrderItems (product_id);
SELECT p.product_name
FROM Products p
LEFT JOIN OrderItems oi ON p.product_id = oi.product_id
WHERE oi.product_id IS NULL;
Explanation:
- Problem with
NOT IN
(Subquery):- The subquery
(SELECT o.product_id FROM OrderItems o)
might execute for each row in theProducts
table, leading to a nested loop operation that can be very slow ifProducts
is large. - Even if the subquery is materialized (executed once), comparing each
product_id
fromProducts
against a potentially large list ofproduct_id
s fromOrderItems
can be inefficient, especially if there areNULL
values inOrderItems.product_id
(thoughNOT IN
withNULL
has specific, sometimes tricky, behavior).
- The subquery
- Solution with
LEFT JOIN
andIS NULL
:- Efficiency: This is generally the most efficient way to find "missing" or "unmatched" rows.
- The
LEFT JOIN
attempts to match every row from theProducts
table (p
) with rows fromOrderItems
(oi
). - If a
product_id
fromProducts
does not have a corresponding match inOrderItems
, then all columns fromOrderItems
in that joined row will beNULL
. - The
WHERE oi.product_id IS NULL
clause then filters for exactly those rows that had no match, effectively giving you products that have never been ordered.
- Indexing: Indexes on
product_id
in bothProducts
andOrderItems
are critical for theLEFT JOIN
to perform efficiently, as they allow the database to quickly find matchingproduct_id
values.
No comments:
Post a Comment