Xpath and Saxon return tabs as text

Dear Oxygen-Users, i am having a problem with an indented XML File. The File looks like this: <?xml version="1.0" encoding="UTF-8"?> <TEI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.i-d-e.de/ns/1.0 "> <teiHeader> <fileDesc> <titleStmt> <title>MS Einsiedeln</title> </titleStmt> <publicationStmt> <p>publicationsStmt empty</p> </publicationStmt><sourceDesc> <p>sourceDesc empty</p> </sourceDesc></fileDesc> </teiHeader> <text> <body> <div> <div> <div> <p><c>D</c>ie gotheit iſt beſloſſen<lb/>in dem vater n<ex>atur</ex>elich dar<lb/>vmbe iſt er alvermvgende<lb/
vnd enpfat niht von ite<lb />des<gap reason=""/> er ſelber nit en iſt an<lb/>ſiner go<unclear >tl</unclear>icher macht wan<lb/>ers weſelich i<ex>n</ex> ime vnd an<lb/>ime ſelben beſloſſen hat<space unit="letters" quantity="1" /></p> </div> </div> </div> </body> </text> </TEI>
Now, using the following XPath 2.0 expression: //text(), the tabs are returned as text-nodes, for example the first tab before the tag <teiHeader>. In fact, my DTD does not allow #PCDATA inside <TEI>, but the document is validated without any problems. To me this seems kind of schizophrenic, or am I mistaken? Btw: the same file in XMLSpy with its build-in xslt engine as well as MS XML parser with the same xPath expression does not return the tabs as text-nodes. Any ideas? Philipp PS: I am using Oxygen 9.3 -- Philipp Steinkrüger M.A. Philosophisches Seminar der Universität zu Köln Thomas-Institut Universitätsstraße 22 50923 Köln +49 221 4702394 philipp.steinkrueger@uni-koeln.de http://www.thomasinstitut.uni-koeln.de http://www.philosophie.uni-koeln.de http://www.ide.de UNIVERSITÄT ZU KÖLN GUTE IDEEN. SEIT 1388.

Hello, Saxon 9 has an option for stripping whitespace nodes but Oxygen allows you to set it only for transformations (Preferences -> XML -> XSLT-FO-XQuery -> XSLT -> Saxon -> Saxon-B/SA). If you set the above option to strip whitespace nodes and you run an XSLT transform that uses the expression //text() you can see that the list of nodes does not contain such nodes. In the next version we will add this Saxon 9 option for XPath expressions too. Regards, Sorin Philipp Steinkrüger wrote:
Dear Oxygen-Users,
i am having a problem with an indented XML File. The File looks like this:
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.i-d-e.de/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>MS Einsiedeln</title> </titleStmt> <publicationStmt> <p>publicationsStmt empty</p> </publicationStmt><sourceDesc> <p>sourceDesc empty</p> </sourceDesc></fileDesc> </teiHeader> <text> <body> <div> <div> <div> <p><c>D</c>ie gotheit iſt beſloſſen<lb/>in dem vater n<ex>atur</ex>elich dar<lb/>vmbe iſt er alvermvgende<lb/>vnd enpfat niht von ite<lb />des<gap reason=""/> er ſelber nit en iſt an<lb/>ſiner go<unclear >tl</unclear>icher macht wan<lb/>ers weſelich i<ex>n</ex> ime vnd an<lb/>ime ſelben beſloſſen hat<space unit="letters" quantity="1" /></p> </div> </div> </div> </body> </text> </TEI>
Now, using the following XPath 2.0 expression: //text(), the tabs are returned as text-nodes, for example the first tab before the tag <teiHeader>. In fact, my DTD does not allow #PCDATA inside <TEI>, but the document is validated without any problems. To me this seems kind of schizophrenic, or am I mistaken? Btw: the same file in XMLSpy with its build-in xslt engine as well as MS XML parser with the same xPath expression does not return the tabs as text-nodes.
Any ideas? Philipp
PS: I am using Oxygen 9.3
-- Philipp Steinkrüger M.A. Philosophisches Seminar der Universität zu Köln Thomas-Institut Universitätsstraße 22 50923 Köln +49 221 4702394
philipp.steinkrueger@uni-koeln.de <mailto:philipp.steinkrueger@uni-koeln.de> http://www.thomasinstitut.uni-koeln.de http://www.philosophie.uni-koeln.de http://www.ide.de
UNIVERSITÄT ZU KÖLN GUTE IDEEN. SEIT 1388.

Hello Philipp, Until we add the option for stripping whitespace-only nodes in the evaluation of expressions on the XPath toolbar you can use the following XPath 2.0 expression which does the stripping: //text()[string-length(translate(., ' ', '')) > 0] Regards, Sorin Sorin Ristache wrote:
Hello,
Saxon 9 has an option for stripping whitespace nodes but Oxygen allows you to set it only for transformations (Preferences -> XML -> XSLT-FO-XQuery -> XSLT -> Saxon -> Saxon-B/SA). If you set the above option to strip whitespace nodes and you run an XSLT transform that uses the expression //text() you can see that the list of nodes does not contain such nodes. In the next version we will add this Saxon 9 option for XPath expressions too.
Regards, Sorin
Philipp Steinkrüger wrote:
Dear Oxygen-Users,
i am having a problem with an indented XML File. The File looks like this:
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.i-d-e.de/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>MS Einsiedeln</title> </titleStmt> <publicationStmt> <p>publicationsStmt empty</p> </publicationStmt><sourceDesc> <p>sourceDesc empty</p> </sourceDesc></fileDesc> </teiHeader> <text> <body> <div> <div> <div> <p><c>D</c>ie gotheit iſt beſloſſen<lb/>in dem vater n<ex>atur</ex>elich dar<lb/>vmbe iſt er alvermvgende<lb/>vnd enpfat niht von ite<lb />des<gap reason=""/> er ſelber nit en iſt an<lb/>ſiner go<unclear >tl</unclear>icher macht wan<lb/>ers weſelich i<ex>n</ex> ime vnd an<lb/>ime ſelben beſloſſen hat<space unit="letters" quantity="1" /></p> </div> </div> </div> </body> </text> </TEI>
Now, using the following XPath 2.0 expression: //text(), the tabs are returned as text-nodes, for example the first tab before the tag <teiHeader>. In fact, my DTD does not allow #PCDATA inside <TEI>, but the document is validated without any problems. To me this seems kind of schizophrenic, or am I mistaken? Btw: the same file in XMLSpy with its build-in xslt engine as well as MS XML parser with the same xPath expression does not return the tabs as text-nodes.
Any ideas? Philipp
PS: I am using Oxygen 9.3

Hi again, At 05:48 AM 9/5/2008, you wrote:
Until we add the option for stripping whitespace-only nodes in the evaluation of expressions on the XPath toolbar you can use the following XPath 2.0 expression which does the stripping:
//text()[string-length(translate(., ' ', '')) > 0]
Or (more succinctly, to similar effect): //text()[normalize-space()] The problem being that either of these expressions will fail to retrieve all the text here <p>Here's a <b>big</b> <i>bad</i> paragraph.</p> since the text node between the 'b' and 'i' elements has only whitespace. Cheers, Wendell ====================================================================== Wendell Piez mailto:wapiez@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================

Thanks Wendell for the useful feedback! I would go for not implementing that as an option in oXygen taking into account the simple expression that gets the same result. Philipp, do you think that you need that as an oXygen option or using the //text()[normalize-space()] expression is ok for you? To match better the relevant text nodes you may try something like below, matching on the text nodes that although contain only whitespaces have siblings that contain some non whitespace text: //text()[normalize-space() or preceding-sibling::text()[normalize-space()] or following-sibling::text()[normalize-space()]] Best Regards, George -- George Cristian Bina <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger http://www.oxygenxml.com Wendell Piez wrote:
Hi again,
At 05:48 AM 9/5/2008, you wrote:
Until we add the option for stripping whitespace-only nodes in the evaluation of expressions on the XPath toolbar you can use the following XPath 2.0 expression which does the stripping:
//text()[string-length(translate(., ' ', '')) > 0]
Or (more succinctly, to similar effect):
//text()[normalize-space()]
The problem being that either of these expressions will fail to retrieve all the text here
<p>Here's a <b>big</b> <i>bad</i> paragraph.</p>
since the text node between the 'b' and 'i' elements has only whitespace.
Cheers, Wendell
====================================================================== Wendell Piez mailto:wapiez@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user

George, At 11:33 AM 9/9/2008, you wrote:
Thanks Wendell for the useful feedback!
Always happy to help. :-)
I would go for not implementing that as an option in oXygen taking into account the simple expression that gets the same result. Philipp, do you think that you need that as an oXygen option or using the //text()[normalize-space()] expression is ok for you?
To match better the relevant text nodes you may try something like below, matching on the text nodes that although contain only whitespaces have siblings that contain some non whitespace text:
//text()[normalize-space() or preceding-sibling::text()[normalize-space()] or following-sibling::text()[normalize-space()]]
Or likewise: //text()[../text()[normalize-space()]] Note that even this isn't the same as getting only the text appearing where a schema says it's permitted (that is, only "significant" whitespace along with non-whitespace text). But it's frequently good enough. Cheers, Wendell ====================================================================== Wendell Piez mailto:wapiez@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================

Hi, In the meantime, Philip should be aware that there is generally only a loose binding between a schema (or DTD) and a document, such that (other things being equal) processors will not automatically strip whitespace-only text nodes from documents without explicit instruction to do so. This is by design, since schemas are not always available to processors, and indeed some operations can and should be able to run without schemas. Whitespace stripping without a schema is dangerous and can frequently result in corrupt data where whitespace was stripped improperly. Accordingly, although the XPath 2.0/XQuery family of technologies provides this feature, Philip may have to get used to its not always being available, for example when using XPath 1.0. In general, it's something to watch out for; automatic whitespace stripping can easily fall into the category of "be careful what you wish for". Cheers, Wendell At 11:23 AM 9/3/2008, Sorin wrote:
Hello,
Saxon 9 has an option for stripping whitespace nodes but Oxygen allows you to set it only for transformations (Preferences -> XML -> XSLT-FO-XQuery -> XSLT -> Saxon -> Saxon-B/SA). If you set the above option to strip whitespace nodes and you run an XSLT transform that uses the expression //text() you can see that the list of nodes does not contain such nodes. In the next version we will add this Saxon 9 option for XPath expressions too.
...
Philipp Steinkrüger wrote:
Dear Oxygen-Users, i am having a problem with an indented XML File. The File looks like this: <?xml version="1.0" encoding="UTF-8"?> <TEI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.i-d-e.de/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>MS Einsiedeln</title> </titleStmt> <publicationStmt> <p>publicationsStmt empty</p> </publicationStmt><sourceDesc> <p>sourceDesc empty</p> </sourceDesc></fileDesc> </teiHeader> <text> <body> <div> <div> <div> <p><c>D</c>ie gotheit it beloſen<lb/>in dem vater n<ex>atur</ex>elich dar<lb/>vmbe it er alvermvgende<lb/>vnd enpfat niht von ite<lb />des<gap reason=""/> er elber nit en it an<lb/>iner go<unclear >tl</unclear>icher macht wan<lb/>ers weelich i<ex>n</ex> ime vnd an<lb/>ime elben beloſen hat<space unit="letters" quantity="1" /></p> </div> </div> </div> </body> </text> </TEI> Now, using the following XPath 2.0 expression: //text(), the tabs are returned as text-nodes, for example the first tab before the tag <teiHeader>. In fact, my DTD does not allow #PCDATA inside <TEI>, but the document is validated without any problems. To me this seems kind of schizophrenic, or am I mistaken? Btw: the same file in XMLSpy with its build-in xslt engine as well as MS XML parser with the same xPath expression does not return the tabs as text-nodes. Any ideas? Philipp PS: I am using Oxygen 9.3
====================================================================== Wendell Piez mailto:wapiez@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
participants (4)
-
George Cristian Bina
-
Philipp Steinkrüger
-
Sorin Ristache
-
Wendell Piez