how can I correct my reconciliation of .csv files to remove dupes/nulls

59 views Asked by At

I have been using code from this answer to check for additions/changes to class rosters from MS Teams:

$set = [System.Collections.Generic.HashSet[string]]::new(
    [string[]] (Import-CSV -Path stundent.csv).UserPrincipalName,
    [System.StringComparer]::InvariantCultureIgnoreCase
)
Import-Csv ad.csv | Where-Object { $set.Add($_.UserPrincipalName) } |
    Export-Csv path\to\output.csv -NoTypeInformation

Ideally, I want to be able to check if there have been removals when compared to a new file, swap the import file positions, and check for additions. If my files look like Source1 and Source2 (below), the check for removals would return Export1, and the check for additions would return Export2. Since there will be multiple instances of students across multiple classes, I want to include TeamDesc in the filter query to make sure only the specific instance of that student with that class is returned.

Source1.csv

TeamDesc UserPrincipalName Name
Team 1 [email protected] john smith
Team 1 [email protected] nancy drew
Team 2 [email protected] harvey dent
Team 3 [email protected] john smith

Source2.csv

TeamDesc UserPrincipalName Name
Team 1 [email protected] nancy drew
Team 2 [email protected] harvey dent
Team 2 [email protected] tim tams
Team 3 [email protected] john smith

Export1.csv

TeamDesc UserPrincipalName Name
Team 1 [email protected] john smith

Export2.csv

TeamDesc UserPrincipalName Name
Team 2 [email protected] tim tams
3

There are 3 answers

0
mklement0 On BEST ANSWER

Try the following, which uses Compare-Object to compare the CSV files by two column values, simply by passing the property (column) names of interest to -Property; the resulting output is split into two collections based on which input side a differing property combination is unique to, using the intrinsic .Where() method:

$removed, $added = (
  Compare-Object (Import-Csv Source1.csv) (Import-Csv Source2.csv) -PassThru `
                 -Property TeamDesc, UserPrincipalName
).Where({ $_.SideIndicator -eq '=>' }, 'Split')

$removed | 
  Select-Object -ExcludeProperty SideIndicator |
  Export-Csv -NoTypeInformation Export1.csv

$added |
  Select-Object -ExcludeProperty SideIndicator |
  Export-Csv -NoTypeInformation Export2.csv
0
Santiago Squarzon On

Assuming both Csvs are stored in memory, Source1.csv is $csv1 and Source2.csv is $csv2, you already have the logic for Export2.csv using the HashSet<T>:

$set = [System.Collections.Generic.HashSet[string]]::new(
    [string[]] $csv1.UserPrincipalName,
    [System.StringComparer]::InvariantCultureIgnoreCase
)
$csv2 | Where-Object { $set.Add($_.UserPrincipalName) }

Outputs:

TeamDesc UserPrincipalName   Name
-------- -----------------   ----
Team 2   [email protected] tim tams

For the first requirement, Export1.csv, the reference object would be $csv2 and instead of a HashSet<T> you could use a hash table, Group-Object -AsHashTable makes it really easy in this case:

$map = $csv2 | Group-Object UserPrincipalName -AsHashTable -AsString

# if Csv2 has unique values for `UserPrincipalName`
$csv1 | Where-Object { $map[$_.UserPrincipalName].TeamDesc -ne $_.TeamDesc }

# if Csv2 has duplicated values for `UserPrincipalName`
$csv1 | Where-Object { $_.TeamDesc -notin $map[$_.UserPrincipalName].TeamDesc }

Outputs:

TeamDesc UserPrincipalName   Name
-------- -----------------   ----
Team 1   [email protected] john smith
0
iRon On

Using this Join-Object script/Join-Object Module (see also: How to compare two CSV files and output the rows that are just in either of the file but not in both and In Powershell, what's the best way to join two tables into one?):

Loading your sample data:
(In your case you probably want to use Import-Csv to import your data)

Install-Script -Name Read-HtmlTable
$Csv1 = Read-HtmlTable https://stackoverflow.com/q/74452725 -Table 0 # Import-Csv .\Source1.csv
$Csv2 = Read-HtmlTable https://stackoverflow.com/q/74452725 -Table 1 # Import-Csv .\Source2.csv
Install-Module -Name JoinModule
$Csv1 |OuterJoin $Csv2 -On TeamDesc, UserPrincipalName -Name Out,In

TeamDesc UserPrincipalName   OutName    InName
-------- -----------------   -------    ------
Team 1   [email protected] john smith
Team 2   [email protected]            tim tams

You might use the (single) result file as is. If you really want to work with two different files, you might split the results as in the nice answer from mklement0.