Java by experience: replace diacritic characters

Introduction:

We had a request to remove all the diacritic from a String. As usual Java has a nice solution for this.

Solution:

String str = "ÚÝâ";

String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");

System.out.println(pattern.matcher(nfdNormalizedString).replaceAll(""));

result: UYa

The Normalizer class according to the javadoc:

This class provides the method normalize which transforms Unicode
text into an equivalent composed or decomposed form, allowing for easier
sorting and searching of text.

Normalizer.Form.NFD

This is an Enum that tells the Normalizer class what kind of normalisation you want there are:

/**

* Canonical decomposition.

NFD,

/**

* Canonical decomposition, followed by canonical composition.

NFC,

/**

* Compatibility decomposition.

NFKD,

/**

* Compatibility decomposition, followed by canonical composition.

NFKC

The NFD variant decomposes the Ú in U and ' this makes it possible to do a regular expression like

Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");

At last but not least the question we had was: what in heavens name is \\p{InCombiningDiacriticalMarks}

This answer we found at stackoverflow:

\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.

Conclusion:

The code we build today is just a tip of the iceberg. The functionality that the Normalizer can bring is much more. I hope in the future I can do some other stuff with this.

Have fun!

Java by experience

vrijdag 8 augustus 2014

replace diacritic characters

Geen opmerkingen:

Een reactie posten

Over mij

Blogarchief