Validation for columns work very slow (SQL Server)

205 views Asked by At

I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:

SELECT 
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
    CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
    FROM tbl

This query works really slow even if I choose only top 1 row. Actually I am not able to get any result, or at least I cannot wait such a long time. The column that I am checking is of type decimal if this is of any importance.

enter image description here

The number of records in the table is: 37,431,866. This is why I choose only top 1000 for example, but still does not load any result for more than 40 minutes

2

There are 2 answers

5
Gordon Linoff On BEST ANSWER

If you want this to run faster, then you don't want to limit the rows in the query you are using. After all, an aggregation query with no GROUP BY only returns one row.

Instead use a subquery:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
     ) t

Note that this is not a random sample. And if you attempt ORDER BY newid() you will kill performance. One alternative to get an approximate n% sample is to use logic like this:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
      WHERE RAND(CHECKSUM(NEWID())) < 0.001
     ) t

The 0.001 would be about a 0.1% sample.

0
Luuk On

Your question can be simpliefied. The part:

CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))

can alo be written as:

CAST(SUM(CASE WHEN TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' THEN 1 ELSE 0 END) AS NUMERIC(25,2))

The second one is quicker then the first one, with the same result. (AFAIK)

This probably can also be applied to the other parts in the query.