Search for Characters in Unicode *range*

Dear all, In order to make sure that we have caught all special characters in an externally transcribed TEI/XML file, I would like to seach for all characters above Unicode Codepoint 0x00ff. Can this be done in the Regular Expression Find box? (I found the search for single unicode codepoints with \u, \x etc., but can't figure out if this can be used to search for characters (not) in codepoint ranges. Thanks for any suggestion, Andreas -- Dr. Andreas Wagner twitter: @anwagnerdreas Project "The School of Salamanca" web: http://salamanca.adwmainz.de Academy of Sciences and Literature, Mainz fon: +49 (0)69/798-32774 and Institute of Philosophy fax: +49 (0)69/798-32794 Goethe University Frankfurt IGF HP 25 / R 2.455 Norbert-Wollheim-Platz 1 60629 Frankfurt am Main

Hi Andreas, sure, this can be done with basic regex query:|[\u00D8-\u00F6]| || |And for your example: [\u0100-\u1F9FF] Unfortunately, oXygen 18 seems to have a bug with this query (precisely: with 5 digit hex codes) as it also matches characters below \u0100 (which is the following of \u00FF). However, you can also work with negation: [^\u0000-\u00FF] And this seems to work fine :) Regards, Tobias | Tobias Fischer XML- und E-Book-Entwicklung Telefon: +49 (0)7071 9876-44 · Fax: -22 Mail: tobias.fischer@pagina-tuebingen.de pagina GmbH - Publikationstechnologien Herrenberger Straße 51 | D-72070 Tübingen www.pagina-online.de | www.parsx.de Handelsregister Stuttgart - HRB 380249 Geschäftsführer: Tobias Ott Am 24.06.2016 um 09:50 schrieb Andreas Wagner:
Dear all,
In order to make sure that we have caught all special characters in an externally transcribed TEI/XML file, I would like to seach for all characters above Unicode Codepoint 0x00ff. Can this be done in the Regular Expression Find box? (I found the search for single unicode codepoints with \u, \x etc., but can't figure out if this can be used to search for characters (not) in codepoint ranges.
Thanks for any suggestion,
Andreas

Hello Tobias, Note that only 4 digits hex codes are supported by the Java/Oxygen regex engine with the \u unicode code point. If you use 5 digits, the 5th digit is interpreted independently as a literal, so this creates undesired side effects. e.g. [\u0100-\u1F9FF] is interpreted as [\u0100-\u1F9F]|[F]. So you are inadvertently also matching "F". Regards, Adrian Adrian Buza oXygen XML Editor and Author Support Tel: +1-650-352-1250 ext.2020 Fax: +40-251-461482 On 24.06.2016 11:17, Tobias Fischer | pagina GmbH wrote:
Hi Andreas,
sure, this can be done with basic regex query:|[\u00D8-\u00F6]| ||
|And for your example: [\u0100-\u1F9FF] Unfortunately, oXygen 18 seems to have a bug with this query (precisely: with 5 digit hex codes) as it also matches characters below \u0100 (which is the following of \u00FF). However, you can also work with negation: [^\u0000-\u00FF] And this seems to work fine :) Regards, Tobias | Tobias Fischer XML- und E-Book-Entwicklung
Telefon: +49 (0)7071 9876-44 · Fax: -22 Mail:tobias.fischer@pagina-tuebingen.de
pagina GmbH - Publikationstechnologien Herrenberger Straße 51 | D-72070 Tübingen www.pagina-online.de |www.parsx.de
Handelsregister Stuttgart - HRB 380249 Geschäftsführer: Tobias Ott
Am 24.06.2016 um 09:50 schrieb Andreas Wagner:
Dear all,
In order to make sure that we have caught all special characters in an externally transcribed TEI/XML file, I would like to seach for all characters above Unicode Codepoint 0x00ff. Can this be done in the Regular Expression Find box? (I found the search for single unicode codepoints with \u, \x etc., but can't figure out if this can be used to search for characters (not) in codepoint ranges.
Thanks for any suggestion,
Andreas
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com https://www.oxygenxml.com/mailman/listinfo/oxygen-user

Either positively [\u0100-\uffff] (it doesn’t seem to stretch above 4 hex digits yet) or [^\u0000-\u00ff] On 24.06.2016 09:50, Andreas Wagner wrote:
Dear all,
In order to make sure that we have caught all special characters in an externally transcribed TEI/XML file, I would like to seach for all characters above Unicode Codepoint 0x00ff. Can this be done in the Regular Expression Find box? (I found the search for single unicode codepoints with \u, \x etc., but can't figure out if this can be used to search for characters (not) in codepoint ranges.
Thanks for any suggestion,
Andreas
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930 Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vöckler

Duh. I bet you must have heard my head banging on the desk even in Leipzig. Thank you a lot, I don't know if I would have come to think of this otherwise. Cheers, Andreas * Imsieke, Gerrit, le-tex dixit [2016-06-24 10:24]:
Either positively [\u0100-\uffff] (it doesn’t seem to stretch above 4 hex digits yet) or [^\u0000-\u00ff]
On 24.06.2016 09:50, Andreas Wagner wrote:
Dear all,
In order to make sure that we have caught all special characters in an externally transcribed TEI/XML file, I would like to seach for all characters above Unicode Codepoint 0x00ff. Can this be done in the Regular Expression Find box? (I found the search for single unicode codepoints with \u, \x etc., but can't figure out if this can be used to search for characters (not) in codepoint ranges.
Thanks for any suggestion,
Andreas
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vöckler _______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com https://www.oxygenxml.com/mailman/listinfo/oxygen-user
-- Dr. Andreas Wagner twitter: @anwagnerdreas Project "The School of Salamanca" web: http://salamanca.adwmainz.de Academy of Sciences and Literature, Mainz fon: +49 (0)69/798-32774 and Institute of Philosophy fax: +49 (0)69/798-32794 Goethe University Frankfurt IGF HP 25 / R 2.455 Norbert-Wollheim-Platz 1 60629 Frankfurt am Main
participants (4)
-
Andreas Wagner
-
Imsieke, Gerrit, le-tex
-
Oxygen XML Editor Support (Adrian Buza)
-
Tobias Fischer | pagina GmbH