How to check if a string contains accented Latin characters like é in Ruby?

2.4k views Asked by At

Given:

str1 = "é"   # Latin accent
str2 = "囧"  # Chinese character
str3 = "ジ"  # Japanese character
str4 = "e"   # English character

How to differentiate str1 (Latin accent characters) from rest of the strings?

Update:

Given

str1 = "\xE9" # Latin accent é actually stored as \xE9 reading from a file

How would the answer be different?

3

There are 3 answers

4
Matt Brictson On BEST ANSWER

I would first strip out all plain ASCII characters with gsub, and then check with a regex to see if any Latin characters remain. This should detect the accented latin characters.

def latin_accented?(str)
  str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end

latin_accented?("é")  #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e")  #=> nil (falsy)
0
Wally Altman On

I'd use a two-stage approach:

  1. Rule out strings containing non-Latin characters by attempting to encode the string as Latin-1 (ISO-8859-1).
  2. Test for accented characters with a regular expression.

Example:

def is_accented_latin?(test_string)
  test_string.encode("ISO-8859-1")   # just to see if it raises an exception

  test_string.match(/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ]/)
rescue Encoding::UndefinedConversionError
  false
end

I strongly suggest you select for yourself the accented characters you're attempting to screen for, rather than just copying what I've written; I certainly may have missed some. Also note that this will always return false for strings containing non-Latin characters, even if the string also contains a Latin character with an accent.

0
codevolution On

Try to use /\p{Latin}/.match(strX) or /\p{Latin}&&[^a-zA-Z]/ (if you want to detect only special Latin characters).

By the way, "e" (str4) is also a Latin character.

Hope it helps.