I have a table like this:
Create table landingzone.classes(
teacher nvarchar(255)
,moment nvarchar(255)
,noOfStudents nvarchar(255)
,scheduledYesNo nvarchar(255)
)
INSERT INTO [landingzone].[classes]
([teacher]
,[moment]
,[noOfStudents]
,[scheduledYesNo])
Select ' Henrov', ' 01/07/2021' ,6 ,'True'
union all
Select ' Bill', ' 01/05/2021' ,6 ,'False'
union all
Select ' Henrov', ' 31/07/2021' ,NULL ,NULL
What I need is a script that finds out that
[teacher]
should be anvarchar(255)
[moment]
should be atimestamp
[noOfStudents]
should be anint
[ScheduledYesNo]
should be anbit
(or anint
with 1/0 as value)
and creates a table like this:
Create table staging.classes(
teacher nvarchar(255)
,moment datetime
,noOfStudents int
,scheduledYesNo bit
)
followed by a datatransfer from landingzone.classes
to staging.classes
.
However, it should do this by analyzing the table dbo.test, not by referring to some config tables containing the names of the columns with associated datatype. Due to the possible large numbers of records in dbo.test where a lot of fields can be empty, it should look past the first 200 (preferably this number should be configurable)
The tables in the landingzone are delivered by other processes and should explicitly keep existing in the current form (business rule).
The challenge I think is mostly in autodiscovering the proper datatypes.
Can this be done in BIML?
There's no Biml method that can help you detect data type.
I was on a long term project ingesting typeless data from a mainframe and we took a similar approach to what you're doing. We landed data as-is into a table with the widest allowable string type from that system*
We wrote a TSQL script that would unpivot the table and then generate N columns that do analysis on the data. A series of tests would look like (free hand coding so go with the spirit not the letter) :
The purpose of that was to generate descriptive statistics on what the column looked like - how often was it populated, what percentage of it could fit into the various data types, min/max lengths
Gotchas we ran into.
No data or sparsely populated data. That was easily the biggest bite in our backsides. If I recall correctly, NULLs would happily cast into whole number types so we had lots of correcting to do after the fact once we had sufficient data. Save yourself some grief, if you don't have much data, leave it as string ;)
Local data rules. We ran into things that canned scripts wouldn't handle. The mainframe used a caret
^
to signal end of timeDates and times. Model204 could handle March 32. Clearly that's the same as April 1, right? Same with March 31 24:30 is clearly April 1 at 00:30
*"Oh, well yeah we do have these fields where we can store binary/very long strings. Didn't we tell you about those?"