Duplicate Records with RAND() function

157 views Asked by At

I have a python program to extract data from Microsoft SQL Server and load them to another table in the same database. Using this extraction program, I am trying to do the following.

  1. Get the input arguments from an excel file
  2. Fetch one random record from the database for each row present in the excel
  3. Load the output to another table

Using the RAND() function, I'm seeing duplicate records being retrieved most of the time even though the combination has a sufficient number of entries in the database. I tried a couple of other approaches like NEWID() and calculating the total number of rows and then retrieving a random row using numpy. But these queries take hours to execute even for a single combination and does not seem feasible.

Note: The table is huge (~7 million records) and left joins are used to retrieve the required data.

Are there any other methods to fix this issue?

1

There are 1 answers

0
SQLpro On

RANDOM() give a randomized value betwwen 0 and 1, but this value is always the same when used in a SELECT steement because without any argument, this will be a "constant".

As a proof :

SELECT RAND() AS RANDOM_VALUE, *
FROM   sys.objects

If you want to pick one row randomly you need to use an UUID or GUID this way :

WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
      FROM   sys.objects)
SELECT *
FROM   T
WHERE  RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)

Of course you can choose MAX or MIN

Another soluton is to use TABLESAMPLE combined to this if the table is veru huge...

WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
      FROM   a_table TABLESAMPLE (1 PERCENT))
SELECT *
FROM   T
WHERE  RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)