I am trying to remove identical lines in a file having 1.8 million records and create a new file. Using the following command:
sort tmp1.csv | uniq -c | sort -nr > tmp2.csv
Running the script creates a new file sort.exe.stackdump
with the following information:
"Exception: STATUS_ACCESS_VIOLATION at rip=00180144805
..
..
program=C:\cygwin64\bin\sort.exe, pid 6136, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B"
The script works for a small file with 10 lines. Seems like sort.exe
cannot handle so many records. How do I work with such a large file with more than 1.8 million records? We do not have any database other than ACCESS and I was trying to do this manually in ACCESS.
The following awk command seemed to be a much faster way to get rid of the uniqe values:
awk '!v[$0]++' $FILE2 > tmp.csv
where $FILE2 is the file name with duplicate values.