Identifying needed statistics - Azure SQL Data Warehouse

1.4k views Asked by At

Is there any hint or directive that can be used with EXPLAIN of a query on Azure SQL Data Warehouse that would return recommended statistics that were not available for the optimizer? Alternatively is there a tool that can analyze a workload and make any recommendation.

2

There are 2 answers

3
Anthony Mattas On BEST ANSWER

Today, no. Right now the recommendation is to create statistics on every column as these are needed to create an optimal parallel query plan (I.e. how to move data around between nodes to return a result since it's a MPP architecture).

https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#maintain-statistics

An example of how to script this out can be found here as well (example H).

https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics#examples-create-statistics

2
wBob On

As you know, statistics should be created (according to this article):

on columns involved in JOINs, GROUP BY, HAVING and WHERE clauses.

There are no tools to do this (yet), but if you have access to the EXPLAIN plans they give you certain information. For example the shuffle_columns element lists all columns involved in a SHUFFLE_MOVE:

<shuffle_columns>col;</shuffle_columns>

as well as myriad other information. Review the annotation I did of an Azure SQL Data Warehouse plan here.

Lastly, (and I haven't actually done this, I've only been thinking about doing it), you could set up a copy of your database on SQL Server 2016, bearing in mind the syntax differences (eg distribution, lack of unique indexes etc). this would give you access to certain useful resources like execution plans, including index suggestions, and certain trace flags which tell you what stats were used. I mean the database engines and indexing are really different so I don't know how worthwhile this might be. I'll post back if I progress my thinking on this. I do find the question "Why is this query going slow?" much harder to answer on this platform that ordinary "box product" SQL Server because the tools aren't as mature yet.