Finding unique combination sets across three columns of data, where order isn't relevant

134 views Asked by At

I am seeking to generate a list of combinations without repetitions of data across three columns (Can be anywhere from two to four columns, but for the purposes of this specific example, sticking with three).

I am using MS Excel (latest version I believe)

Each Column has data in it (for purposes of explaining, the data is the alphabet, letter A through to N. So 14 rows).

The data in Column 1 to 3 is the exact same

Column 1 Column 2 Column 3
A A A
B B B
...down too
N N N

What I am trying to solve is what are the unique three-column data combinations (i.e. without repetitions) that can be generated from these three columns with 14 rows, all with the exact same data.

As an example, a potential row (of many) could be....

Column 1 Column 2 Column 3
A B c

Of this one row, possible permutations are CAB, CBA, ABC, ACB, BCA, BAC.

Although each are different, for my purposes, 5 of these 6 are considered duplications. The combination of A, B and C, in any order, would be considered one unique combination. The same letters in a different order is a duplication.

This problem could be broken down into two stages

Stage 1 - Generating a list of all the permutations across the three columns and 14 rows. The output must be keep all the columns separate

Column 1 Column 2 Column 3
A A A
A A B
A B A
B A A

etc...

This would include all permutations with repetitions.

Stage 2 - Removing duplicates across all these permutations to get the unique combinations (i.e. combinations without repetitions).

Of the c.2,744 rows, a number of them would be considered duplicates (as outlined earlier...CAB and CBA are the same)

This second stage should still maintain all the data in their separate columns. Concatenating is simple to do if required.

Ideally, I would love something that easily solves for both stages, to improve efficiency.

For Problem Stage 1 - I have found a few things in various forums / websites which help, and I believe I have been able to generate.

For Problem Stage 2 - Again I have searched around for solutions and tried various formula's etc. however they don't work

Update: I have taken the approach suggested by DjC, with the results producing the desired outcome alongside providing the flexibility for a novice like me to adapt as required. See picture below for my little proof check -> https://i.stack.imgur.com/Fzwl9.png

2

There are 2 answers

4
FunThomas On

If I understood you correctly, having 14 possible values (letter A..N) and 3 columns, you have 14 * 13 * 12 = 2184 possible combinations if no letter may occur more than once in one row.

As always 6 (3 * 2 *1) entries are simple variations, you end up with 2184/6 = 384 rows of data.

The following function CreateAllEntries will create those entries (using a global array variable result). You feed the routine with an array of possible values - in the example it's a list of characters from A..N, but you can fill the array with anything - and with the number of columns you want to fill (eg 3).

The routine will create an array with the data, and you can do what you want with that array - in the example (see routine FillSheet that is used as test routine) I have written it to a sheet.

As you can see, the function calls a routine createEntries which will call itself recursively, on level per column you want to fill. It iterates through all possible values that are not already taken.

Note that I declared the result array and the variable row as globals for pure lazyness.

Option Explicit
Dim result As Variant, row As Long

Sub FillSheet()
    Const colCount As Long = 3

    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets(1)
        
    ' Fill array with possible characters
    ' (Instead of characters, you could fill the array with anything,
    ' Eg Cities, Beveraged, Dates, Names...
    
    Dim possibleValues(1 To 14) As Variant, i As Long
    For i = 1 To 14
        possibleValues(i) = Chr(Asc("A") + i - 1)
    Next
    
    Dim rowCount As Long
    rowCount = CreateAllEntries(possibleValues, colCount)
    If rowCount < 0 Then Exit Sub
    
    ' Dump result
    ws.UsedRange.ClearContents
    For i = 1 To colCount
        ws.Cells(1, i) = i & ". column"
    Next
    ws.Cells(2, 1).Resize(rowCount, colCount) = result
    
    MsgBox rowCount & " rows written to sheet " & ws.Name
    
End Sub

Function CreateAllEntries(possibleValues() As Variant, colCount As Long) As Long
    
    ' Calculate result size
    Dim rowCount As Long, i As Long
    rowCount = 1
    For i = UBound(possibleValues) To UBound(possibleValues) - colCount + 1 Step -1
        rowCount = rowCount * i
    Next
    For i = 1 To colCount
        rowCount = rowCount / i
    Next
    
    ' Create an array that will hold the complete result
    If rowCount > Rows.Count Then
        MsgBox "Sorry, this will exceed the possible sheet size of " & Rows.Count & " rows."
        CreateAllEntries = -1
        Exit Function
    End If
    ReDim result(1 To rowCount, 1 To colCount)
    
    ' Create an array that temporarily stores the values for one row entry
    ReDim colValues(0 To colCount) As Variant
    
    ' fill array (recursive rountine)
    createEntries possibleValues, 1, colCount, colValues
    CreateAllEntries = rowCount

    
End Function


Sub createEntries(possibleValues() As Variant, col As Long, colCount As Long, colValueIndex() As Variant)
    Dim i As Long
    ' Loop over all "remaining" entries
    For i = colValueIndex(col - 1) + 1 To UBound(possibleValues)
        colValueIndex(col) = i
        If col = colCount Then
            ' We have reached the last column, dump current values into a new row
            Dim j As Long
            row = row + 1
            For j = 1 To colCount
                result(row, j) = possibleValues(colValueIndex(j))
            Next j
        Else
           ' Create entries for the next column
           createEntries possibleValues, col + 1, colCount, colValueIndex
        End If
    Next i
End Sub
2
DjC On

With Excel for MS365, you can use a custom recursive lambda function to generate all possible permutations. To do this, open Name Manager (Ctrl+F3) and define a new name called PERMA with the following formula:

=LAMBDA(array1,[array2],
    IF(
        ISOMITTED(array2),
        REDUCE(CHOOSECOLS(array1, 1), SEQUENCE(COLUMNS(array1)-1,, 2), LAMBDA(v,n, PERMA(v, CHOOSECOLS(array1, n)))),
        LET(
            a, ROWS(array1),
            b, ROWS(array2),
            HSTACK(
                CHOOSEROWS(array1, TOCOL(IF(SEQUENCE(, b), SEQUENCE(a)))),
                CHOOSEROWS(array2, TOCOL(IF(SEQUENCE(a), SEQUENCE(, b))))
            )
        )
    )
)

Once defined, you can then use the custom PERMA() function for Stage 1 to generate a list of all permutations with repetitions. For example:

=PERMA(A2:C15)

This will output the same number of results as =PERMUTATIONA(14,3) (2744 rows).

The formula for Stage 2 will depend on what you want to be returned...

To generate a list of all permutations without repetitions, equivalent to =PERMUT(14,3) (2184 rows), use the following:

=LET(
    data, PERMA(A2:C15),
    cols, COLUMNS(data),
    FILTER(data, BYROW(data, LAMBDA(r, COLUMNS(UNIQUE(r, 1))=cols)))
)

To generate a list of all combinations with repetitions, equivalent to =COMBINA(14,3) (560 rows), use the following:

=LET(
    data, PERMA(A2:C15),
    cols, COLUMNS(data),
    arr, TOCOL(data),
    rowId, TOCOL(IF(SEQUENCE(, cols), SEQUENCE(ROWS(data)))),
    UNIQUE(WRAPROWS(SORTBY(arr, rowId,, arr, ), cols))
)

To generate a list of all combinations without repetitions, equivalent to =COMBIN(14,3) (364 rows), use the following:

=LET(
    data, PERMA(A2:C15),
    cols, COLUMNS(data),
    arr, TOCOL(data),
    rowId, TOCOL(IF(SEQUENCE(, cols), SEQUENCE(ROWS(data)))),
    unq, UNIQUE(WRAPROWS(SORTBY(arr, rowId,, arr, ), cols)),
    FILTER(unq, BYROW(unq, LAMBDA(r, COLUMNS(UNIQUE(r, 1))=cols)))
)

Example Results:

perma_examples.png

Please note, the PERMA() function is limited to generating 1,048,576 possible permutations. As such, the maximum number of rows in your dataset cannot exceed 1024 with 2 columns, 101 with 3 columns, 32 with 4 columns, or 16 with 5 columns, and these limits are even less when using the last two COMBIN methods shown above (due to the row limitations of the TOCOL() function). Exceeding the limit will return the #NUM! error.