How to merge specific lines from multiple text files

101 views Asked by At

I have four files each containing 153 data points. Each data point cosists of 3 lines, ie.

File 1:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1

File 2:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file2
datapoint_2_name
datapoint_2_info
datapoint_2_data_file2
datapoint_3_name
datapoint_3_info
datapoint_3_data_file2

File 3:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file3
datapoint_2_name
datapoint_2_info
datapoint_2_data_file3
datapoint_3_name
datapoint_3_info
datapoint_3_data_file3

File 4:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file4

and so on.

The data in all files is the same except for the third line of each. I am trying to merge these files in such a way that the output contains the datapoint_name, datapoint_info, from just the first file, and then the third line (datapoint_data) from all remaining files, like so:

Output:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_1_data_file2
datapoint_1_data_file3
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_2_data_file2
datapoint_2_data_file3
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1
datapoint_3_data_file2
datapoint_3_data_file3
datapoint_3_data_file4

I've tried with the below script in Python (I've replaced the pattern matching with 'some pattern' in these lines; the patterns are matching the lines correctly and I've verified that)

output_file = "combined_sequences_and_data2.txt"

with open(output_file, 'w') as output:
    combined_data = []

    with open('file1', 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('some pattern'):
                combined_data.append(line)
            elif line.isalpha():
                combined_data.append(line)
            elif line.startswith('some pattern'):
                combined_data.append(line)
                with open('file2', 'r') as file:
                    for line in file:
                        line = line.strip()
                        if line.startswith('some pattern'):
                            combined_data.append(line)
                            with open('file3', 'r') as file:
                                for line in file:
                                    line = line.strip()
                                    if line.startswith('some pattern'):
                                        combined_data.append(line)
                                        with open('file4', 'r') as file:
                                            for line in file:
                                                line = line.strip()
                                                if line.startswith('some pattern'):
                                                    combined_data.append(line)



        # Write the combined data to the output file
        output.write('\n'.join(combined_data) + '\n')

This doesn't run at all just freezes and I can't understand where.

I also tried awk:

`#!/bin/bash

file1="filename"
file2="filename"
file3="filename"
file4="filename"

group_size=3
line_count=1

while read -r line; do
  if [ $line_count -le $group_size ]; then
    group_lines[$line_count]=$line
    line_count=$((line_count + 1))
  fi

  if [ $line_count -gt $group_size ]; then
    for i in "${group_lines[@]}"; do
      echo "$i"
    done

    awk 'NR == 3' "$file2"
    awk 'NR == 3' "$file3"
    awk 'NR == 3' "$file4"

    line_count=1
    unset group_lines
  fi
done < "$file1"`

This one is closer to working but doesn't loop over the 3rd lines for the remaining 3 files - just prints the same line over and over for each datapoint 1 in file 1

6

There are 6 answers

0
SIGHUP On BEST ANSWER

You don't need to examine the file contents as you know that the values you're interested in are in groups of 3. Therefore:

INFILES = "file1", "file2", "file3", "file4"
OUTFILE = "combined_sequences_and_data2.txt"

mfd, *ofds = (open(file) for file in INFILES)

with open(OUTFILE, "w") as output:
    for i, line in enumerate(mfd, 1):
        output.write(line)
        if i % 3:
            for fd in ofds:
                next(fd)
        else:
            for fd in ofds:
                output.write(next(fd))
0
Daweo On

Problem I spot in this piece of your code

with open('file1', 'r') as file:
    for line in file:
        line = line.strip()
        if line.startswith('some pattern'):
            combined_data.append(line)
        elif line.isalpha():
            combined_data.append(line)
        elif line.startswith('some pattern'):
            combined_data.append(line)
            with open('file2', 'r') as file:
                for line in file:
                    line = line.strip()
                    if line.startswith('some pattern'):
                        combined_data.append(line)
                        with open('file3', 'r') as file:
                            for line in file:
                                line = line.strip()
                                if line.startswith('some pattern'):
                                    combined_data.append(line)
                                    with open('file4', 'r') as file:
                                        for line in file:
                                            line = line.strip()
                                            if line.startswith('some pattern'):
                                                combined_data.append(line)

is that you are shadowing file variable, you should use different variable names in your withs whilst dealing with multiple files.

1
karakfa On

another awk without caching data

$ paste file{1..4} | 
  awk -F'\t' '{if (NR%3) print $1; else for(i=1;i<=NF;i++) print $i}'

assumes files are in the represented format since it doesn't do validation, although not hard to add.

0
markp-fuso On

Assumptions:

  • contents of all 4 files can fit in memory (via awk arrays)
  • contents of files are NOT the literal strings datapoint_#_name, datapoint_#_info and datapoint_#_data_file1 (OP should update the question to show examples of actual data)
  • data lines do not include embedded linefeeds

One awk idea (replaces OP's current while | for | awk{3} script):

file[1]="file1"                                                  # save actual filenames in bash file[] array
file[2]="file2"
file[3]="file3"
file[4]="file4"

awk '
FNR   == 1 { pt=0 }                                              # reset our index/counter at beginning of new file
FNR%3 == 1 { name[++pt] = $0 }                                   # increment index/counter, save "name" entry         
FNR%3 == 2 { info[pt]   = $0 }                                   # save "info" entry
FNR%3 == 0 { dfile[pt]  = dfile[pt] (dfile[pt] ? ORS : "") $0 }  # save "data file" entry by appending to previous entries
END        { for (i=1; i<=pt; i++)                               # loop through index/counter range
                 print name[i] ORS info[i] ORS dfile[i]          # print array entries
           }
' "${file[@]}"                                                   # obtain filenames from bash file[] array

This generates:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_1_data_file2
datapoint_1_data_file3
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_2_data_file2
datapoint_2_data_file3
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1
datapoint_3_data_file2
datapoint_3_data_file3
datapoint_3_data_file4
0
Ed Morton On

Using any awk as long as the number of input files doesn't exceed the "too many open files" threshold (4 won't be a problem):

$ cat tst.awk
BEGIN {
    rslt = 1
    while ( rslt > 0 ) {
        ++lineNr
        for ( i=1; i<ARGC; i++ ) {
            rslt = (getline < ARGV[i])
            if ( (rslt > 0) && ((i == 1) || (lineNr%3 == 0)) ) {
                print
            }
        }
    }
}

$ awk -f tst.awk File1 File2 File3 File4
datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_1_data_file2
datapoint_1_data_file3
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_2_data_file2
datapoint_2_data_file3
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1
datapoint_3_data_file2
datapoint_3_data_file3
datapoint_3_data_file4

Alternatively, inspired by @karakfa's answer but doesn't require there to be no tabs in the input (but is heavily dependent on the number of input files being 4):

$ paste -d$'\n' File1 File2 File3 File4 |
    awk '{n=((NR-1)%12)+1} (NR%4 == 1) || ((9 <= n) && (n <= 12))'
datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_1_data_file2
datapoint_1_data_file3
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_2_data_file2
datapoint_2_data_file3
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1
datapoint_3_data_file2
datapoint_3_data_file3
datapoint_3_data_file4
2
Sahil On

You can achieve your desired output by reading the files in sequential method and merging the data point information from the first file with the third lines from the other files:

output_file = 'combined_output.txt'

files = ['file1', 'file2', 'file3', 'file4']

with open(output_file, 'w') as output:
    combined_data = []
    data_point_info = []

    for file_index, file_name in enumerate(files):
        with open(file_name, 'r') as file:
            for line in file:
                line = line.strip()
                if line.isalpha():
                    data_point_info.append(line)
                elif line.startswith('some pattern'):
                    data_point_info.append(line)
                elif line.startswith('datapoint'):
                    combined_data.extend(data_point_info)
                    data_point_info = []
    output.write('\n'.join(combined_data) + '\n')