Omit words in a text

Question

Omit words in a text

136 views Asked by aDoN At 18 June 2015 at 13:19

Let's say I have this file (file.txt):

Hello my name is Giorgio,
I would like to go with you
to the cinema my friend

I want to exclude from the text the words: my, is and I (not the whole line).

The words are in a file (words.txt) like this:

my
is
I

So the output must be:

Hello name Giorgio,
would like to go with you
to the cinema friend

How can this be performed?

Original Q&A

There are 3 answers

123 On 18 June 2015 at 13:47

Pretty scruffy version in awk. If the list of words contains meta characters then this will die.It does take into account word boundaries though, so won't match in the middle of words.

awk 'FNR==NR{a[$1];next}
     {for(i in a)gsub("(^|[^[:alpha:]])"i"([^[:alpha:]]|$)"," ")}1' {words,file}.txt

Hello name Giorgio,
 would like to go with you
to the cinema friend

It saves the words from the first file into array a. In the next file for each word saved it simply removes that word from the line using alpha(All alphabetic characters) and the line beginning and end to ensure the word is a complete word. 1 prints the line.

Jahid On 18 June 2015 at 13:30

This should do it:

#!/bin/bash
cp file.txt newfile.txt # we will change newfile.txt in place
while IFS= read -r line;do
[[ $line != "" ]] && sed -i "s/\b$line[[:space:]]*//g" newfile.txt
done <words.txt
cat newfile.txt

Or modifying @choroba's sed solution:

sed 's=^=s/\\b=;s=$=[[:space:]]*//g=' words.txt | sed -f- file.txt

Both of the above will strip spaces (if any) from the end of matching string.

Output:

Hello name Giorgio,
would like to go with you
to the cinema friend #There's a space here (after friend)

**choroba** · Accepted Answer · 2015-06-18T13:40:51+00:00

You can use sed to turn words.txt into a sed script:

sed 's=^=s/=;s=$=//g=' words.txt | sed -f- file.txt

The difference to the expected output is the whitespace: removing a word doesn't squeeze the surrounding whitespace.

To match only whole words, add the word boundaries \b:

s=^=s/\\b=;s=$=\\b//g=

Perl solution that also squeezes the spaces (and doesn't care about meta characters):

#!/usr/bin/perl
use warnings;
use strict;

open my $WORDS, '<', 'words.txt' or die $!;
my %words;
chomp, $words{$_} = q() while <$WORDS>;

open my $TEXT, '<', 'file.txt' or die $!;
while (<$TEXT>) {
    s=( ?\b(\S+)\b ?)=$words{$2} // $1=ge;
    print;
}

TechQA.

Omit words in a text

There are 3 answers

Related Questions in REGEX

Related Questions in BASH

Related Questions in TEXT-PROCESSING

Popular Questions

Popular Tags

Trending Questions