I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:
SELECT
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
FROM tbl
This query works really slow even if I choose only top 1 row. Actually I am not able to get any result, or at least I cannot wait such a long time. The column that I am checking is of type decimal if this is of any importance.
The number of records in the table is: 37,431,866. This is why I choose only top 1000 for example, but still does not load any result for more than 40 minutes
If you want this to run faster, then you don't want to limit the rows in the query you are using. After all, an aggregation query with no
GROUP BY
only returns one row.Instead use a subquery:
Note that this is not a random sample. And if you attempt
ORDER BY newid()
you will kill performance. One alternative to get an approximate n% sample is to use logic like this:The 0.001 would be about a 0.1% sample.