What is the good way to read data from CSV and converting them to JSON?

238 views Asked by At

I am trying to read the data from CSV file which has 2200000 records using PowerShell and storing each record in JSON file, but this takes almost 12 hours.

Sample CSV Data:

We will only concern about the 1st column value's.

enter image description here

Code:

function Read-IPData
{
    $dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
    Write-Host "file path - $dbFilePath"

    Write-Host "Reading..."
    $data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
    Write-Host "Reading data finished"

    $count = $data.Count
    Write-host "Total $count records found"

    return $data
}

function Convert-NumbetToIP
{
    param(
    [Parameter(Mandatory=$true)][string]$number
    )

    try
    {    
        $w = [int64]($number/16777216)%256
        $x = [int64]($number/65536)%256
        $y = [int64]($number/256)%256
        $z = [int64]$number%256

        $ipAddress = "$w.$x.$y.$z"
        Write-Host "IP Address - $ipAddress"

        return $ipAddress
    }
    catch
    {
        Write-Host "$_"
        continue
    }
}

Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData

Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"

if(!(Test-Path $outputFile))
{
    Write-Host "$outputFile doestnot exist, creating..."
    New-Item -Path $outputFile -type "file"
}

foreach($item in $data)
{
    $row = $item -split ","
    $ipNumber = $row[0].trim('"')

    Write-Host "Converting $ipNumber to ipaddress"
    $toIpAddress = Convert-NumbetToIP -number $ipNumber

    Write-Host "Preparing document JSON"
    $object = [PSCustomObject]@{
        "ip-address" = $toIpAddress
        "is-vpn" = "true"
        "@timestamp" = (Get-Date).ToString("o")
    }

    $document = $object | ConvertTo-Json -Compress -Depth 100
    Write-Host "Adding document - $document"
    Add-Content -Path $outputFile $document
}

Could you please help optimize the code or is there a better way to do it. or is there a way like multi-threading.

2

There are 2 answers

5
zett42 On BEST ANSWER

Here is a possible optimization:

function Get-IPDataPath
{
    $dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object FullName | Select-Object -First 1
    Write-Host "file path - $dbFilePath"
    $dbFilePath  # implicit output
}

function Convert-NumberToIP
{
    param(
        [Parameter(Mandatory=$true)][string]$number
    )

    [Int64] $numberInt = 0
    if( [Int64]::TryParse( $number, [ref] $numberInt ) ) {
        if( ($numberInt -ge 0) -and ($numberInt -le 0xFFFFFFFFl) ) {
            # Convert to IP address like '192.168.23.42'
            ([IPAddress] $numberInt).ToString()
        }
    }
    # In case TryParse() returns $false or the number is out of range for an IPv4 address, 
    # the output of this function will be empty, which converts to $false in a boolean context.
}

$dbFilePath = Get-IPDataPath
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"

Write-Host "Converting CSV file $dbFilePath to $outputFile"

$object = [PSCustomObject]@{
    'ip-address' = ''
    'is-vpn' = 'true'
    '@timestamp' = ''
}

# Enclose foreach loop in a script block to be able to pipe its output to Set-Content
& {
    foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
    {
        $row = $item -split ','
        $ipNumber = $row[0].trim('"')

        if( $ip = Convert-NumberToIP -number $ipNumber ) 
        {
            $object.'ip-address' = $ip
            $object.'@timestamp' = (Get-Date).ToString('o')

            # Implicit output
            $object | ConvertTo-Json -Compress -Depth 100
        }

    }
} | Set-Content -Path $outputFile

Remarks for improving performance:

  • Avoid Get-Content, especially for line-by-line processing it tends to be slow. A much faster alternative is the File.ReadLines method. To skip the header line, use the Linq.Enumerable.Skip() method.
  • There is no need to read the whole CSV into memory first. Using ReadLines in a foreach loop does lazy enumeration, i. e. it reads only one line per loop iteration. This works because it returns an enumerator instead of a collection of lines.
  • Avoid try and catch if exceptions occur often, because the "exceptional" code path is very slow. Instead use Int64.TryParse() which returns a boolean indicating successful conversion.
  • Instead of "manually" converting the IP number to bytes, use the IPAddress class which has a constructor that takes an integer number. Use its method .GetAddressBytes() to get an array of bytes in network (big-endian) order. Finally use the PowerShell -join operator to create a string of the expected format.
  • Don't allocate a [pscustomobject] for each row, which has some overhead. Create it once before the loop and inside the loop only assign the values.
  • Avoid Write-Host (or any output to the console) within inner loops.

Unrelated to performance:

  • I've removed the New-Item call to create the output file, which isn't necessary because Set-Content automatically creates the file if it doesn't exist.
  • Note that the output is in NDJSON format, where each line is like a JSON file. In case you actually want this to be a regular JSON file, enclose the output in [ ] and insert a comma , between each row.

Modified processing loop to write a regular JSON file instead of NDJSON file:

& {
    '[' # begin array
    $first = $true

    foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
    {
        $row = $item -split ','
        $ipNumber = $row[0].trim('"')

        if( $ip = Convert-NumberToIP -number $ipNumber ) 
        {
            $object.'ip-address' = $ip
            $object.'@timestamp' = (Get-Date).ToString('o')
        
            $row = $object | ConvertTo-Json -Compress -Depth 100

            # write array element delimiter if necessary
            if( $first ) { $row; $first = $false } else { ",$row" } 
        }

    }
    ']'  # end array
} | Set-Content -Path $outputFile
0
Theo On

You can optimize the function Convert-NumberToIP like below:

function Convert-NumberToIP {
    param(
        [Parameter(Mandatory=$true)][uint32]$number
    )

    # either do the math yourself like this:

    # $w = ($number -shr 24) -band 255
    # $x = ($number -shr 16) -band 255
    # $y = ($number -shr 8) -band 255
    # $z = $number -band 255
    # '{0}.{1}.{2}.{3}' -f $w, $x, $y, $z  # output the dotted IP string

    # or use .Net:
    $n = ([IPAddress]$number).GetAddressBytes()
    [array]::Reverse($n)
    ([IPAddress]$n).IPAddressToString
}