What should I do to ensure optimal query performance using transact-SQL in Azure SQL Data Warehouse while combining data sets from SQL and non-relational data in Azure Blob storage? Any inputs would be greatly appreciated.

2

There are 2 answers

1
Sahaj Saini - MSFT On

The best practice is to load data from Azure Blob Storage into SQL Data Warehouse instead of attempting interactive queries over that data.

The reason is that when you run a query against your data residing in Azure Blob Storage (via an external table), SQL Data Warehouse (under-the-covers) imports all the data from Azure Blob Storage into SQL Data Warehouse temporary tables to process the query. So even if you a run SELECT TOP 1 query on your external table, the entire dataset for that table will be imported temporarily to process the query.

As a result, if you know that you will querying the external data frequently, it is recommended that you explicitly load the data into SQL Data Warehouse permanently using a CREATE TABLE AS SELECT command as shown in the document: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-load-with-polybase/.

0
Nicolle Anger - MSFT On

As a best practice, break your Azure Storage data into no more than 1GB files when possible for parallel processing with SQL Data Warehouse. More information about how to configure Polybase in SQL Data Warehouse to load data from Azure Storage Blob is here: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-load-with-polybase/

Let me know if that helps!