Easy way to sort multiline paragraph where only first lines begin with non-space?

52 views Asked by At

I have a text or log file which typically looks like this:

First line which is also a paragraph.
Another line that is its own paragraph.
etc. etc.

but occasionally it has some spill-over into a multi-line paragrah:

First line which is also a paragraph.
Another line that is its own paragraph.
Now, this paragraph encompasses more than a single line
    with its second line onwards being indented by spaces
    to distinguish it from the paragraph-opener, although
    it could just as well have been tabs etc.
And this is another paragraph.

I would like to sort these paragraphs lexicographically; and I don't mind if it's by the first line only or the entire paragraph. If these were one-liner paragraphs - then Bob's your uncle, we got sort. But what do I do otherwise?

I know that, in principle, I could:

  1. Define an escaping scheme
  2. Escape newlines which are followed by white space (and escape the escaping character itself)
  3. Sort the resulting one-line-per-paragraph file
  4. Un-escape

but this seems a bit cumbersome. Can I do better?

Notes:

  • I realize this is doable in a straightforward way using an awk or perl script, but the closer an answer is to a one-liner, the better.
  • You may make reasonable assumptions in your answer, such as the GNU variants of certain tools, or POSIX compatibility, or minimum versions of tools etc. But - please make them explicitly.
2

There are 2 answers

7
Shawn On

A few ways:

A one-liner pipeline that uses perl to read the entire file, and insert a 0 byte between paragraphs (Defined as a newline immediately followed by a non-whitespace character), sort to sort them, and finally tr to remove those 0 bytes again from the final output. Basically a simpler version of your idea.

perl -0777 -pe 's/^(?=\S)/\0/gm'  | sort -z | tr -d '\0'

(Does require a version of sort that understands the -z option)


Or a pure perl one-liner that's a bit more verbose but does it all in one process without a pipeline or needing non-POSIX sort options:

perl -ne 'if (/^\s/) { $lines[-1] .= $_ } else { push @lines, $_ }
          END { print for sort @lines }' input.txt

Similar approach using GNU awk instead:

gawk '/^[^[:space:]]/ { lineno++ }
      { lines[lineno] = lines[lineno] $0 "\n" }
      END {
        PROCINFO["sorted_in"] = "@val_str_asc"
        for (lineno in lines)
          printf("%s", lines[lineno])
      }' input.txt

Or if you're okay installing extra stuff, I found a nifty looking program written in perl called ptp (Install through your OS package manager if available or with cpan App::PTP/cpanm App::PTP/other preferred CPAN client):

ptp --input-separator '\n(?=\S)' --sort input.txt
2
tripleee On

Here's a solution using GNU sed. The null bytes make this unlikely to be portable to other sed variants (and the -z option to many commands to handle null-separated input is also a GNU extension).

sed -E 's/^(\S)/\x00\1/' |
sort -z |
tr -d '\000'

You will notice the similarities to the Perl solution.

The escapes \x00 and \S are not generally available in sed. Many sed implementations specifically choke on null bytes, even if you somehow managed to insert a null byte.

The -E option is also nonstandard, though rather widely supported. It enables a somewhat more convenient and familiar regex syntax than the crude default BRE. (In concrete terms, we don't have to backslash the capturing parentheses.)

Tested on Debian 11 (Bullseye); GNU sed 4.7.


Here's a variant which I managed to get to work on MacOS. It assumes you have no literal escape characters in your input. (It also requires you to replace the ^[ with a literal escape character; in many shells you can do this with ctrl-V Esc; or, of course, write the script in an editor and then run it.)

sed -n -e 's/^\([[:space:]]\)/^[\1/' -e H -e '$!d' \
    -e 'x' -e 's/\n^[/^[/g' -ep |
sort | sed 's/^[/\n/g'

This collects the entire file into the hold space of sed, so it will probably not work very well for large files.

In brief, if the first character is a whitespace character, prepend an escape character. The H command appends the current line to the hold space. $!d says if this is not the last input line, delete this line and start over. If we fall through, we are at the end of the file; copy the entire file from the hold space back into the pattern space x and replace any newline before an escape character with just an escape character, and print it all out.

At this point, what you have are lines where the internal newlines have been replaced with an escape character; and so the regular sort works fine.

Finally, we put back the newlines where the escape characters were.

I believe the support for \n should be reasonably ubiquitous, but there might be really old sed versions which require further modifications around that. (If they are that old, they also might not support the POSIX character class notation [[:space:]]. I guess you can replace it with [ ] where the second character is a literal tab character).