There is a lot of discussion on the Internet about how to compare files in PowerShell. For example:
- Comparing folders and content with PowerShell, StackOverflow, 2011
- Easily Compare Two Folders by Using PowerShell, Doctor Scripto, 2011
- Compare Files with PowerShell: a faster way, Kees Bakker, 2013
- powershell binary file comparison, StackOverflow, 2013
- Can PowerShell Compare-Object do a binary compare?, StackOverflow, 2021
However, nothing that I've found discusses the speed differences of the different ways of doing a comparison.
(The article above by Kees Bakker and his answer to the 2013 SO question present the function FilesAreEqual. The article title claims it's faster, but doesn't say faster than what and doesn't offer any data to back up the claim. In my answer below, the function bFilesCompareBinary is adapted from his, and you'll see that my data agree with his claim.)
This question is a follow-up to my question, PowerShell: Why are these file comparison times so different?. In more research on that, I've compiled data on the speed of various methods of comparing binary files in PowerShell. I'm posting this question in order to provide those data in an answer. So the question is:
How fast are different methods of comparing binary files in PowerShell?
Contents
Data
The table below gives, for four files and their identical copies, data on the speed of comparison of the pairs by seven different methods. These four files were selected for convenience of being located together in a path on my computer and in a corresponding path on an external SSD. They all happen to be gvi video files, which should not have been relevant, but a quirk of their structure turned out to have an interesting effect on one of the methods. The table gives the speed, in Mb/sec, of the comparison process for each method and for each file. The speed was calculated by dividing the size of the file by the elapsed time of the process. Code is given further below for the scripts that performed and timed the comparisons.
The columns of the table are:
comp.FC.compare-objectacting on aget-contentof each file to be compared.compare-objectin which theget-contents have the parameter-raw.compare-objectin which theget-contents have just the parameter-encoding byte(PS 5) or-AsByteStream(PS 7), but this sat for over a half-hour in both PS 5 and 7, so either the process hung or it took so long that it might as well have hung.compare-objectin which theget-contents have the parameters-encoding byte(PS 5) or-AsByteStream(PS 7) plus-raw.compare-objectin which theget-contents have the parameters-encoding byte(PS 5) or-AsByteStream(PS 7) plus-ReadCount 0.bFilesCompareBinary, based on code written by Kees Bakker, which performs a buffered comparison (code included in script below).Since the pairs tested were all identical, all the measurements had to compare all bytes in the files. For pairs that are not necessarily identical, the Windows commands and the buffered method have the ability to abort after detecting a difference, and so could run even faster. The
compare-objectmethods compare the entire files, even if the first bytes are different.Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by seven methods running in Windows batch, in Windows PowerShell 5.1, and in PowerShell 7.
Note that with the method "Compare-object", the third and fourth files run much faster than the first two. This was the mystery that my original question asked about, and is explained in its answers.
Errors and mistakes
In the case indicated as "Error" (method "Compare as byte read 0" in PS 5 on the largest file), the process crashed PowerShell with the message, "get-content : Array dimensions exceeded supported range."
As I've pointed out elsewhere, the "compare raw" method crashed with an
OutOfMemoryExceptionwhen presented with a pair of files of 3.7 Gb.Warning: In initial testing, results appeared to indicate that the Windows command
FCwas about seven times faster than the buffered method. I had already performed a comparison of a 1 Tb folder with its backup that took about 10 hours using the buffered method. I was excited thatFCcould work so much faster, so I rewrote my script to repeat that comparison usingFCinstead, and was then confused to find that it took 14 hours. Then I realized that the initial results had been skewed by Windows caching the files when I ran the comparisons withcomp, so they ran much faster when doing it again withFC. In the results reported above, the measurements were made with an empty cache. I have not found a way to remove a file from cache, so each measurement was made immediately after rebooting the computer (and with nothing else running).Conclusion
compare-objectis essentially useless on binary files. It only gave any reasonable speed when called withget-content ... -rawnot "as byte", but when doing that it crashed on files over a few Gb.FC,Environment
The data above were collected on an AMD Ryzen 7 Pro 6850H Processor with 32 Gb of RAM, running Windows 11 Pro 64. The files in each pair are on an internal SSD and an external USB SSD.
Later, I repeated the tests of just methods "FC" and "buffered" with the external storage being a USB spinning hard drive instead of the SSD. I was surprised to see a dramatic speed improvement with that change:
Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by two methods running in PowerShell 7. Difference from previous table is that one file in each pair was on a spinning HD instead of an SSD.
I don't know if this means my low cost SSD has poor performance, or if it's because I don't have the right cable for it. It's not a big problem for me because I don't run these comparisons often, but it does show the hardware dependence of such a process.
Size dependence of speed of buffered method
I used the buffered method to run a comparison of a 1.1 Tb folder with its backup. This took 10.1 hours, of which 9.8 hr was the sum of the elapsed times of the comparisons (i.e., overhead of 0.3 hr from scanning of folders). Thus the average speed of the comparisons was 116 Gb/hr or 33 Mb/sec. The size of the files ranged from 1 byte to 32 Gb.
To learn about the factors that affect the speed of the comparisons, I used Excel to rank the 222,000 files by size and comparison time of the 1,355 files with comparison times over 1 second, and by comparison speed of the 2,009 files with times over 1/2 second.
There was a rough, but far from perfect, correlation of file size and comparison speed. The 25 largest files, ranging from 4 to 32 Gb, had speeds ranging from 34.8 to 36.8 Mb/sec. These were close to, but not the fastest speeds, of which the top 25 ranged from 36.7 to 36.9 Mb/sec, with sizes ranging from 61 Mb to 28 Gb.
At the lower end, the 25 smallest ranked files, ranging from 22 kb to 33 Mb, had speeds ranging from 13 kb/sec to 31 Mb/sec. The slowest 25 ranked speeds ranged from 2 kb/sec. to 13 Mb/sec, with sizes ranging from 2 kb to 55 Mb.
It's helpful that in general, larger files compare faster. Definitely better than the other way around!
Code
I'm interested in feedback on improving these scripts, with two provisos. First, I know the batch script is pretty lame; it was just laid out quickly to get the job done. More attention was paid to the design of the PowerShell script. In that, I know that my coding style is unconventional, but I've developed it over many years, and I can only apologize if you don't like it. However, please do say something if you see ways to improve the functionality of the script.
It would also be interesting to hear if other people run the script and get results that are consistent with mine or different.
Windows batch scripts for
compandFC:The console output was copy pasted into Excel, which then subtracted the times to get the elapsed time of each process. The batch for
FCwas the same withcomp /mreplaced withFC /b.PowerShell script, including function
bFilesCompareBinary: