How to Replace Accented Latin Characters in Ruby

How do I replace accented Latin characters in Ruby?

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

Replace accented character in Ruby

The in your name are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"

However the é in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.

How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name simpy call String#unicode_normalize with :nfc or :nfkc as argument to compose (U+0065 and U+0301) to é (U+00E9):

name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!

Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.

I also like to or even should point out that converting é to e or ü to u causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).

Ruby method to remove accents from UTF-8 international characters

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"

How to check if a string contains accented Latin characters like é in Ruby?

I would first strip out all plain ASCII characters with gsub, and then check with a regex to see if any Latin characters remain. This should detect the accented latin characters.

def latin_accented?(str)
str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end

latin_accented?("é") #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e") #=> nil (falsy)

How to match latin and not latin characters by normalised version of string?

It may be prohibitively hard to normalize the thing you match against, so I recommend changing the regex.

I don't know if Ruby supports the [=o=] (which matches o and all its accented versions) POSIX bracket expression syntax, but there is also another way.

Replace every letter with an alternative accented form with a character class. For example:

/Bart[lł]omiej [ZŻ][oó][lł][cć]/g

Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

it also removes spaces, dots, dashes, and who knows what else.

It shouldn't.

string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s

You've mistyped, there should be a backslash before the x00, to refer to the NUL character.

/[^\-x00-\x7F]/n # So it would leave the dash alone

You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.



Related Topics



Leave a reply



Submit