how do remove carriage returns in a txt file

1.1k views Asked by At

I recently received some data items 99 pipe delimited txt files, however in some of them and ill use dataaddress.txt as an example, where there is a return in the address eg

14 MakeUp Road

Hull

HU99 9HU

It goming out on 3 rows rather than one, bear in made there is data before and after this address separated by pipes. It just seems to be this addresss issue which is causing me issues in oading the txt file correcting using SSIS.

Rather than go back to source I wondered if there was a way we can manipulate the txt file to remove these carriage returns while not affected the row end returns if that makes sense.

2

There are 2 answers

0
Ivan Georgiev On BEST ANSWER

I would use sed or awk. I will show you how to do this with awk, because it more platform independent. If you do not have awk, you can download a mawk binary from http://invisible-island.net/mawk/mawk.html.

The idea is as follows - tell awk that your line separator is something different, not carriage return or line feed. I will use comma.

Than use a regular expression to replace the string that you do not like.

Here is a test file I created. Save it as test.txt:

1,Line before ...
2,Broken line ... 14 MakeUp Road

Hull

HU99 9HU
3,Line after

And call awk as follows:

    awk 'BEGIN { RS = ","; ORS=""; s=""; } $0 != "" {  gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU"); print s $0; s="," }' test.txt

I suggest that you save the awk code into a file named cleanup.awk. Here is the better formatted code with explanations.

BEGIN {
  # This block is executed at the beginning of the file
  RS = ","; # Tell awk our records are separated by comma
  ORS="";   # Tell awk not to use record separator in the output
  s="";     # We will print this as record separator in the output
}

{
 # This block is executed for each line.
 # Remember, our "lines" are separated by commas.

 # For each line, use a regular expression to replace the bad text.
 gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU"); 

 # Print the replaced text - $0 variable represents the line text.
 print s $0; s=","
}

Using the awk file, you can execute the replacement as follows:

awk -f cleanup.awk test.txt

To process multiple files, you can create a bash script:

for f in `ls *.txt`; do
    # Execute the cleanup.awk program for each file.
    # Save the cleaned output to a file in a directory ../clean
    awk -f cleanup.awk $f > ../clean/$f
done
0
Ivan Georgiev On

You can use sed to remove the line feed and carriage return characters:

sed ':a;N;$!ba;s/MakeUp Road[\n\r]\+/MakeUp Road /g' test.txt | sed ':a;N;$!ba;s/Hull[\n\r]\+/Hull /g'

Explanation:

  1. :a create a label 'a'
  2. N append the next line to the pattern space
  3. $! if not the last line, ba branch (go to) label 'a'
  4. s substitute command, \n represents new line, \r represents carriage return, [\n\r]+ - match new line or carriage return in a sequence as many times as they occur (at least one), /g global match (as many times as it can)

sed will loop through step 1 to 3 until it reach the last line, getting all lines fit in the pattern space where sed will substitute all \n characters