How to grep folder that contains UTF-16 or UTF-32 encoded files?

Question

How to grep folder that contains UTF-16 or UTF-32 encoded files?

1k views Asked by Peter At 17 July 2020 at 10:02

There exist three identical txt-files in one folder and each file has only one word inside: "hello". First file is encoded in UTF-8, second one in UTF-16 and the last in UTF-32 (all files created on linux). But using grep

grep -i "hello" *.txt

returns only one result, it's the UTF-8-file. Grep does not find the other two files.

How can I grep a folder that partially contains UTF-16 or UTF-32 encoded files?

Original Q&A

There are 2 answers

**Shawn** · Answer 1 · 2020-07-17T10:24:59+00:00

One way uses perl instead of grep:

$ perl -CO -Mopen="IN,:encoding(UTF-16)" -ne 'print if /hello/i' utf16_file.txt

with the obvious change for the UTF-32 files.

This tells perl to use UTF-8 for output, that files opened for reading are encoded in UTF-16, and only print the lines that (case-insensitively) match the regular expression inside the //'s.

Or use iconv to convert the file first:

$ iconv -f UTF-16 -t UTF-8 utf16_file.txt | grep -i hello

If you don't have an easy way to tell from the filename what encoding it is, maybe something like this script that uses file to try to guess the encoding and then iconv to convert to UTF-8 to feed to GNU grep:

#!/bin/sh

# This assumes we're running in a UTF-8 locale
to_charset=UTF-8

regexp="$1"
shift 1

for file in "$@"; do
    case "$(file "$file")" in
        *UTF-16*) charset=UTF-16;;
        *UTF-32*) charset=UTF-32;;
        *UTF-8*) charset=UTF-8;;
        *ASCII*) charset=ASCII;;
        *) echo "$file has an unknown encoding." >&2
           charset=ASCII;;
    esac
    #echo "Using $charset for $file"
    iconv -f "$charset" -t "$to_charset" "$file" | \
        grep -i -H --label "$file" "$regexp"
done

Usage: smartgrep hello *.txt

**Cyril Chaboisseau** · Answer 2 · 2022-10-11T08:02:36+00:00

ripgrep is a very good alternative to many grep programs (including GNU grep) which has the advantage to find the requested string/pattern regardless of text encoding (UTF-8, UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more)

You can use the same way as grep

$ rg -i pattern *.txt

This chart compares the different greplike tools: https://beyondgrep.com/feature-comparison/

TechQA.

How to grep folder that contains UTF-16 or UTF-32 encoded files?

There are 2 answers

Related Questions in LINUX

Related Questions in ENCODING

Related Questions in GREP

Related Questions in UTF-16

Related Questions in UTF-32

Popular Questions

Popular Tags

Trending Questions