Deduce dataypes from table with only nvarchar fields?

79 views Asked by At

I have a table like this:

Create table landingzone.classes(
teacher nvarchar(255)
,moment nvarchar(255)
,noOfStudents nvarchar(255)
,scheduledYesNo nvarchar(255)
)

INSERT INTO [landingzone].[classes]
           ([teacher]
           ,[moment]
           ,[noOfStudents]
           ,[scheduledYesNo]) 
Select ' Henrov', '  01/07/2021' ,6 ,'True' 
union all 
Select ' Bill', '  01/05/2021' ,6 ,'False' 
union all
Select ' Henrov', '  31/07/2021' ,NULL ,NULL

What I need is a script that finds out that

  • [teacher] should be a nvarchar(255)
  • [moment] should be a timestamp
  • [noOfStudents] should be an int
  • [ScheduledYesNo] should be an bit (or an int with 1/0 as value)

and creates a table like this:

Create table staging.classes(
teacher nvarchar(255)
,moment datetime
,noOfStudents int
,scheduledYesNo bit
)

followed by a datatransfer from landingzone.classes to staging.classes.

However, it should do this by analyzing the table dbo.test, not by referring to some config tables containing the names of the columns with associated datatype. Due to the possible large numbers of records in dbo.test where a lot of fields can be empty, it should look past the first 200 (preferably this number should be configurable)

The tables in the landingzone are delivered by other processes and should explicitly keep existing in the current form (business rule).

The challenge I think is mostly in autodiscovering the proper datatypes.

Can this be done in BIML?

1

There are 1 answers

1
billinkc On BEST ANSWER

There's no Biml method that can help you detect data type.

I was on a long term project ingesting typeless data from a mainframe and we took a similar approach to what you're doing. We landed data as-is into a table with the widest allowable string type from that system*

We wrote a TSQL script that would unpivot the table and then generate N columns that do analysis on the data. A series of tests would look like (free hand coding so go with the spirit not the letter) :

SELECT
  MIN(LEN([teacher]))
, MAX(LEN([teacher]))
, COUNT_BIG(DISTINCT([teacher]))
, COUNT_BIG(WHEN NULLIF(LTRIM(RTRIM([teacher], ''))) IS NOT NULL THEN 1 END AS NotNullOrEmpty
, COUNT_BIG(WHEN NULLIF(LTRIM(RTRIM([teacher], ''))) IS NULL THEN 1 END AS NullOrEmpty
, COUNT_BIG(CASE WHEN TRY_CONVERT('bigint', [teacher]) IS NULL THEN 1 END) AS NotBigInt
, COUNT_BIG(CASE WHEN TRY_CONVERT('bigint', [teacher]) IS NOT NULL THEN 1 END) AS BigInt

The purpose of that was to generate descriptive statistics on what the column looked like - how often was it populated, what percentage of it could fit into the various data types, min/max lengths

Gotchas we ran into.

No data or sparsely populated data. That was easily the biggest bite in our backsides. If I recall correctly, NULLs would happily cast into whole number types so we had lots of correcting to do after the fact once we had sufficient data. Save yourself some grief, if you don't have much data, leave it as string ;)

Local data rules. We ran into things that canned scripts wouldn't handle. The mainframe used a caret ^ to signal end of time

Dates and times. Model204 could handle March 32. Clearly that's the same as April 1, right? Same with March 31 24:30 is clearly April 1 at 00:30

*"Oh, well yeah we do have these fields where we can store binary/very long strings. Didn't we tell you about those?"