It can’t be figured out with the manual how to manage this problem with syncsort (we found solutions with dfsort which didn’t help). Due to a program error (which can’t be fixed in time, you know: programmer, test, quality check, deployment...) we got duplicate records in a file (FB/LRECL 250) where then
- a header line exists
- subsequent duplicate data lines which have to be omitted but the one unique
- data lines must not be sorted (due to obligate logical relations of some records)
- the trailer includes the data line count.
The file can not manually be edited because of its size (>2 mio records).
example infile:
HEADER xxxx
cccc
bbbb 123
bbbb 123
bbbb 123
dddd
aaaa 123
aaaa 123
aaaa
TRAILER COUNT: 8
Expected outfile:
HEADER xxxx
cccc
bbbb 123
dddd
aaaa 123
aaaa
TRAILER COUNT: 5
So the outfile is not sorted at all, the omitted records
bbbb 123 (omitted)
bbbb 123 (omitted)
aaaa 123 (omitted)
are not needed at all and may go straight into Nirvana.
(I would even be happy with a solution omitting header/trailer as I could easily concatenate manually generated lines in the subsequent job.)
Thanks for your help!
I was able to achieve your expected result using two SYNCSORT steps.
Step 1:
Using INREC, I've appended Sequence number in the first 4 bytes followed by the actual data record. Then, I've sorted the file with first 8 bytes as the key. Header record is being skipped using SKIPREC.
Step 2:
In Step 2, output file from STEP 1 is being read as input. As you expect the data lines to be not sorted, I've sorted the input with Sequence number as the key. Using OUTREC, I'm restraining from writing the Sequence number in the final output file. I've used TRAILER1 to print the count of records at the last.
Hope this helps. Please let me know if you've an alternative which works more efficiently.