vrijdag 8 augustus 2014

replace diacritic characters

Introduction:

We had a request to remove all the diacritic  from a String. As usual Java has a nice solution for this.

Solution:


String str = "ÚÝâ";

String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");

System.out.println(pattern.matcher(nfdNormalizedString).replaceAll(""));

result: UYa

The Normalizer class according to the javadoc:

 This class provides the method normalize which transforms Unicode
 text into an equivalent composed or decomposed form, allowing for easier
 sorting and searching of text.


Normalizer.Form.NFD
This is an Enum that tells the Normalizer class what kind of normalisation you want there are:

     /**
         * Canonical decomposition.
         */
        NFD,

        /**
         * Canonical decomposition, followed by canonical composition.
         */
        NFC,

        /**
         * Compatibility decomposition.
         */
        NFKD,

        /**
         * Compatibility decomposition, followed by canonical composition.
         */
        NFKC

The NFD variant decomposes the Ú in U and ' this makes it possible to do a regular expression like 
Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");

At last but not least the question we had was: what in heavens name is \\p{InCombiningDiacriticalMarks}
This answer we found at stackoverflow:

\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.

Conclusion:

The code we build today is just a tip of the iceberg. The functionality that the Normalizer can bring is much more. I hope in the future I can do some other stuff with this.

Have fun! 

Geen opmerkingen:

Een reactie posten