How to match chinese characters with grep?

8.3k views Asked by At

It is verified that [\u4e00-\u9fff] can match chinese characters in vim.

:%g/[\u4e00-\u9fff]/d

The command above can delete all the lines containing chinese characters.

ls  /tmp/test
ktop 1_001.png.bak
fonts.dir.bak
New
Screenshot from 2016-09-12 16:50:29.png.bak
你好

Now i want to extract files whose name is chinese characters.

ls  /tmp/test |grep -P  '[\x4e\x00-\x9f\xff]'  

The command can't get files whose name is chinese characters.
How to fix it?

ls /tmp/test | grep -v '[a-z]' can get it ,but it is what i want.

2

There are 2 answers

0
sideshowbarker On BEST ANSWER

To match just lines (filenames) that have Han (Chinese) characters, you can use [\p{Han}] :

ls  /tmp/test | grep -P '[\p{Han}]'

\p{Han} is one Unicode-script category property usable in any PCRE-supporting engine:

\p{Common} \p{Arabic} \p{Armenian} \p{Bengali} \p{Bopomofo}
\p{Braille} \p{Buhid} \p{Canadian_Aboriginal} \p{Cherokee}
\p{Cyrillic} \p{Devanagari} \p{Ethiopic} \p{Georgian} \p{Greek}
\p{Gujarati} \p{Gurmukhi} \p{Han} \p{Hangul} \p{Hanunoo} \p{Hebrew}
\p{Hiragana} \p{Inherited} \p{Kannada} \p{Katakana} \p{Khmer} \p{Lao}
\p{Latin} \p{Limbu} \p{Malayalam} \p{Mongolian} \p{Myanmar} \p{Ogham}
\p{Oriya} \p{Runic} \p{Sinhala} \p{Syriac} \p{Tagalog} \p{Tagbanwa}
\p{TaiLe} \p{Tamil} \p{Telugu} \p{Thaana} \p{Thai} \p{Tibetan}
0
Des Nerger On

Neither grep -P '[\p{Han}]', nor grep -P "[一-鿿]" approach worked on my old version of grep (2.10). Yet, if the character encoding is guaranteed to be UTF-8, we can always expand the \u4e00-\u9fff range down to the byte level:

ls  /tmp/test |grep -P  '[\xE5-\xE9][\x80-\xBF][\x80-\xBF]|\xE4[\xB8-\xBF][\x80-\xBF]'

And it worked just fine with my version.