I am trying to read the data from CSV file which has 2200000 records using PowerShell and storing each record in JSON file, but this takes almost 12 hours.
Sample CSV Data:
We will only concern about the 1st column value's.
Code:
function Read-IPData
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
Write-Host "file path - $dbFilePath"
Write-Host "Reading..."
$data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
Write-Host "Reading data finished"
$count = $data.Count
Write-host "Total $count records found"
return $data
}
function Convert-NumbetToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
try
{
$w = [int64]($number/16777216)%256
$x = [int64]($number/65536)%256
$y = [int64]($number/256)%256
$z = [int64]$number%256
$ipAddress = "$w.$x.$y.$z"
Write-Host "IP Address - $ipAddress"
return $ipAddress
}
catch
{
Write-Host "$_"
continue
}
}
Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData
Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
if(!(Test-Path $outputFile))
{
Write-Host "$outputFile doestnot exist, creating..."
New-Item -Path $outputFile -type "file"
}
foreach($item in $data)
{
$row = $item -split ","
$ipNumber = $row[0].trim('"')
Write-Host "Converting $ipNumber to ipaddress"
$toIpAddress = Convert-NumbetToIP -number $ipNumber
Write-Host "Preparing document JSON"
$object = [PSCustomObject]@{
"ip-address" = $toIpAddress
"is-vpn" = "true"
"@timestamp" = (Get-Date).ToString("o")
}
$document = $object | ConvertTo-Json -Compress -Depth 100
Write-Host "Adding document - $document"
Add-Content -Path $outputFile $document
}
Could you please help optimize the code or is there a better way to do it. or is there a way like multi-threading.
Here is a possible optimization:
Remarks for improving performance:
Get-Content
, especially for line-by-line processing it tends to be slow. A much faster alternative is theFile.ReadLines
method. To skip the header line, use theLinq.Enumerable.Skip()
method.ReadLines
in aforeach
loop does lazy enumeration, i. e. it reads only one line per loop iteration. This works because it returns an enumerator instead of a collection of lines.try
andcatch
if exceptions occur often, because the "exceptional" code path is very slow. Instead useInt64.TryParse()
which returns aboolean
indicating successful conversion.IPAddress
class which has a constructor that takes an integer number. Use its method.GetAddressBytes()
to get an array of bytes in network (big-endian) order. Finally use the PowerShell-join
operator to create a string of the expected format.[pscustomobject]
for each row, which has some overhead. Create it once before the loop and inside the loop only assign the values.Write-Host
(or any output to the console) within inner loops.Unrelated to performance:
New-Item
call to create the output file, which isn't necessary becauseSet-Content
automatically creates the file if it doesn't exist.[ ]
and insert a comma,
between each row.Modified processing loop to write a regular JSON file instead of NDJSON file: