Iterate a windows ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?

Question

Iterate a windows ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?

165 views Asked by somebadhat At 17 February 2019 at 21:18

This code works. I just want to see how much faster someone can make it work.

Backup your Windows 10 batch file in case something goes wrong. Find all instances of string {LINE2 1-9999} and replace with {LINE2 "line number the code is on"}. Overwrite, encoding as ASCII.

If _61.bat is:

TITLE %TIME%   NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE   LINE2 1243
TITLE %TIME%   DOC/SET YQJ8   LINE2 1887
SET ztitle=%TIME%: WINFOLD   LINE2 2557
TITLE %TIME%   _*.* IN WINFOLD   LINE2 2597
TITLE %TIME%   %%ZDATE1%% YQJ25   LINE2 3672
TITLE %TIME%   FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922

Results:

TITLE %TIME%   NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE   LINE2 1
TITLE %TIME%   DOC/SET YQJ8   LINE2 2
SET ztitle=%TIME%: WINFOLD   LINE2 3
TITLE %TIME%   _*.* IN WINFOLD   LINE2 4
TITLE %TIME%   %%ZDATE1%% YQJ25   LINE2 5
TITLE %TIME%   FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 6

Code:

Copy-Item $env:windir\_61.bat -d $env:temp\_61.bat
(gc $env:windir\_61.bat) | foreach -Begin {$lc = 1} -Process {
    $_ -replace "LINE2 \d*", "LINE2 $lc";
    $lc += 1
} | Out-File -Encoding Ascii $env:windir\_61.bat

I expect this to take less than 984 milliseconds. It takes 984 milliseconds. Can you think of anything to speed it up?

Original Q&A

There are 1 answers

**mklement0** · Accepted Answer · 2019-02-18T18:58:55+00:00

The key to better performance in PowerShell code (short of embedding C# code compiled on demand with Add-Type, which may or may not help) is to:

avoid use of cmdlets and the pipeline in general,
- especially invocation of a script block ({...}) for each pipeline input object, such as with ForEach-Object and Where-Object
- However, it isn't the pipeline per se that is to blame, it is the current inefficient implementation of these cmdlets - see GitHub issue #10982 - and there is a workaround that noticeably improves pipeline performance:
```
 # Faster alternative to:
 #   1..10 | ForEach-Object { $_ * 10 }
 1..10 | . { process { $_ * 10 } }

 # Faster alternative to:
 #   1..10 | Where-Object { $_ -gt 5 }
 1..10 | . { process { if ($_ -gt 5) { $_ } } }
```
avoiding the pipeline requires direct use of the .NET framework types as an alternative to cmdlets.
if feasible, use switch statements for array or line-by-line file processing - switch statements generally outperform foreach loops.

To be clear: The pipeline and cmdlets offer clear benefits, so avoiding them should only be done if optimizing performance is a must.

In your case, the following code, which combines the switch statement with direct use of the .NET framework for file I/O seems to offer the best performance - note that the input file is read into memory as a whole, as an array of lines, and a copy of that array with the modified lines is created before it is written back to the input file:

$file = "$env:temp\_61.bat" # must be a *full* path.
$lc = 0
$updatedLines = & { switch -Regex -File $file {
  '^(.*? LINE2 )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
  default { ++$lc; $_ } # pass non-matching lines through
} }
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)

Note:

Enclosing the switch statement in & { ... } is an obscure performance optimization explained in this answer.
If case-sensitive matching is sufficient, as suggested by the sample input, you can improve performance a little more by adding the -CaseSensitive option to the switch command.

In my tests (see below), this provided a more than 4-fold performance improvement in Windows PowerShell relative to your command.

Here's a performance comparison via the Time-Command function:

The commands compared are:

The switch command from above.
A slightly streamlined version of your own command.
A PowerShell Core v6.1+ alternative that uses the -replace operator with the array of lines as the LHS and a scriptblock as the replacement expression.

Instead of a 6-line sample file, a 6,000-line file is used. 100 runs are being averaged. It's easy to adjust these parameters.

# Sample file content (6 lines)
$fileContent = @'
TITLE %TIME%   NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE   LINE2 1243
TITLE %TIME%   DOC/SET YQJ8   LINE2 1887
SET ztitle=%TIME%: WINFOLD   LINE2 2557
TITLE %TIME%   _*.* IN WINFOLD   LINE2 2597
TITLE %TIME%   %%ZDATE1%% YQJ25   LINE2 3672
TITLE %TIME%   FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922

'@

# Determine the full path to a sample file.
# NOTE: Using the *full* path is a *must* when calling .NET methods, because
#       the latter generally don't see the same working dir. as PowerShell.
$file = "$PWD/test.bat"

# Create the sample file with the sample content repeated N times.
$repeatCount = 1000 # -> 6,000 lines
[IO.File]::WriteAllText($file, $fileContent * $repeatCount)

# Warm up the file cache and count the lines.
$lineCount = [IO.File]::ReadAllLines($file).Count

# Define the commands to compare as an array of scriptblocks.
$commands =
  { # switch -Regex -File + [IO.File]::Read/WriteAllLines()
    $i = 0
    $updatedLines = & { switch -Regex -File $file {
      '^(.*? LINE2 )\d+(.*)$' { $Matches[1] + ++$i + $Matches[2] }
      default { ++$lc; $_ }
    } }
   [IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
  },
  { # Get-Content + -replace + Set-Content
    (Get-Content $file) | ForEach-Object -Begin { $i = 1 } -Process {
      $_ -replace "LINE2 \d*", "LINE2 $i"
      ++$i
    } | Set-Content -Encoding Ascii $file
  }

# In PS Core v6.1+, also test -replace with a scriptblock operand.
if ($PSVersionTable.PSVersion.Major -ge 6 -and $PSVersionTable.PSVersion.Minor -ge 1) {
  $commands +=
    { # -replace with scriptblock + [IO.File]::Read/WriteAllLines()
      $i = 0
      [IO.File]::WriteAllLines($file,
        ([IO.File]::ReadAllLines($file) -replace '(?<= LINE2 )\d+', { (++$i) }),
        [text.encoding]::ASCII
      )
    }
} else {
  Write-Warning "Skipping -replace-with-scriptblock command, because it isn't supported in this PS version."
}

# How many runs to average.
$runs = 100

Write-Verbose -vb "Averaging $runs runs with a $lineCount-line file of size $('{0:N2} MB' -f ((Get-Item $file).Length / 1mb))..."

Time-Command -Count $runs -ScriptBlock $commands

Here are sample results from my Windows 10 machine (the absolute timings aren't important, but hopefully the relative performance show in in the Factor column is somewhat representative); the PowerShell Core version used is v6.2.0-preview.4

# Windows 10, Windows PowerShell v5.1

WARNING: Skipping -replace-with-scriptblock command, because it isn't supported in this PS version.
VERBOSE: Averaging 100 runs with a 6000-line file of size 0.29 MB...

Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.108               # switch -Regex -File + [IO.File]::Read/WriteAllLines()...
4.22   0.455               # Get-Content + -replace + Set-Content...


# Windows 10, PowerShell Core v6.2.0-preview 4

VERBOSE: Averaging 100 runs with a 6000-line file of size 0.29 MB...

Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.101               # switch -Regex -File + [IO.File]::Read/WriteAllLines()…
1.67   0.169               # -replace with scriptblock + [IO.File]::Read/WriteAllLines()…
4.98   0.503               # Get-Content + -replace + Set-Content…

TechQA.

Iterate a windows ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?

There are 1 answers

Related Questions in PERFORMANCE

Related Questions in POWERSHELL

Related Questions in REPLACE

Related Questions in FILE-IO

Related Questions in POWERSHELL-V5.1

Popular Questions

Trending Questions