We had a request to remove all the diacritic from a String. As usual Java has a nice solution for this.
Solution:
String str = "ÚÝâ";
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");
System.out.println(pattern.matcher(nfdNormalizedString).replaceAll(""));
result: UYa
The Normalizer class according to the javadoc:
This class provides the method normalize which transforms Unicode
text into an equivalent composed or decomposed form, allowing for easier
sorting and searching of text.
Normalizer.Form.NFD
This is an Enum that tells the Normalizer class what kind of normalisation you want there are:
/**
* Canonical decomposition.
*/
NFD,
/**
* Canonical decomposition, followed by canonical composition.
*/
NFC,
/**
* Compatibility decomposition.
*/
NFKD,
/**
* Compatibility decomposition, followed by canonical composition.
*/
NFKC
The NFD variant decomposes the Ú in U and ' this makes it possible to do a regular expression like
Pattern pattern = Pattern .compile("\\p{InCombiningDiacriticalMarks}+");
At last but not least the question we had was: what in heavens name is \\p{InCombiningDiacriticalMarks}
This answer we found at stackoverflow:
\p{InCombiningDiacriticalMarks}
is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}
, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
Conclusion:
The code we build today is just a tip of the iceberg. The functionality that the Normalizer can bring is much more. I hope in the future I can do some other stuff with this.
Have fun!
Geen opmerkingen:
Een reactie posten