I have a python program to extract data from Microsoft SQL Server and load them to another table in the same database. Using this extraction program, I am trying to do the following.
- Get the input arguments from an excel file
- Fetch one random record from the database for each row present in the excel
- Load the output to another table
Using the RAND() function, I'm seeing duplicate records being retrieved most of the time even though the combination has a sufficient number of entries in the database. I tried a couple of other approaches like NEWID() and calculating the total number of rows and then retrieving a random row using numpy. But these queries take hours to execute even for a single combination and does not seem feasible.
Note: The table is huge (~7 million records) and left joins are used to retrieve the required data.
Are there any other methods to fix this issue?
RANDOM() give a randomized value betwwen 0 and 1, but this value is always the same when used in a SELECT steement because without any argument, this will be a "constant".
As a proof :
If you want to pick one row randomly you need to use an UUID or GUID this way :
Of course you can choose MAX or MIN
Another soluton is to use TABLESAMPLE combined to this if the table is veru huge...