Xerces command line parsing?

Hello, I am looking for the right syntax and method to be able to batch-parse XML files from the command line using Xerces. I need to use Xerces as I am attempting to replicate parsing using oXygen (which has Xerces as its default parser). If anyone can send along the syntax for doing this or can point me to a resource that can help, I'd very much appreciate it. I previously used xmllint/LIBXML to do command line parsing of my TEI files, which worked well for files calling on the TEI xlite DTD. I am now dealing with files that use the full TEI and must rely on the xml catalog, i.e.: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [ <!ENTITY % TEI.XML 'INCLUDE'> <!ENTITY % TEI.mixed 'INCLUDE'> <!ENTITY % TEI.drama 'INCLUDE'> <!ENTITY % TEI.corpus 'INCLUDE'> <!ENTITY % TEI.prose 'INCLUDE'> <!ENTITY % TEI.figures 'INCLUDE'> <!ENTITY % TEI.linking 'INCLUDE'> <!ENTITY % TEI.transcr 'INCLUDE'> <!ENTITY % TEI.names.dates 'INCLUDE'> <!ENTITY % TEI.spoken 'INCLUDE'> <!ENTITY % TEI.header 'INCLUDE'> <!ENTITY % ISOlat1 SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat1.ent'> %ISOlat1; <!ENTITY % ISOlat2 SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat2.ent'> %ISOlat2; <!ENTITY % ISOnum SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-num.ent'> %ISOnum; <!ENTITY % ISOpub SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-pub.ent'> %ISOpub; ]> I need to use Xerces, because I find that the default parser in oXygen (which is Xerces) can successfully parse these files (and LIBXML does not work for files using the full TEI due to problems with the DTD). My best understanding (which may be completely off) is that to use Xerces as an XML parser in the command line, what I am essentially doing, is using the syntax to run an XML file through an XSL stylesheet (on the assumption that the source file has to validate to run successfully. I have modified a previous stylesheet that processes all TEI elements found in these documents, and I use this syntax: java com.icl.saxon.StyleSheet -x org.apache.xerces.parsers.SAXParser source_file.xml stylesheet.xsl > /dev/null I am using Xerces as it comes with oXygen (and have not downloaded it separately). Since I am only really interested in parsing and not the output, I pipe it to /dev/null. I have the following in my bash profile for the PATH: CLASSPATH=$CLASSPATH:/Applications/oxygen/lib/saxon.jar:\ /Applications/oxygen/frameworks/docbook/xsl/extensions/saxon653.jar.ext:/App lications/oxygen/lib/xercesImpl.jar export CLASSPATH The above command WORKS, and will pick up SOME errors, but is clearly missing others. Does anyone have any more straightforward syntax for just PARSING with Xerces, or have any ideas why some errors (I have tested) are not being reported through this process? (One possibility is that it's just checking well-formedness, not validity, which I need to test further.) Thanks in advance for any help/suggestions. Andrew Andrew Rouner Digital Library Services Washington University Libraries St. Louis, MO EMAIL: arouner@wustl.edu
From: Oxygen XML Editor support <support@oxygenxml.com> Date: Tue, 25 Jul 2006 12:47:23 +0300 To: Andrew Rouner <arouner@wustl.edu> Subject: Re: Differences in validators/ dtd problems?
Dear Andrew Rouner,
Thank you for contacting us. The default parser used by oXygen is Xerces 2.8.0 (that is the latest Xerces version). This looks at a first glance like a problem/bug in XMLLINT. If you want to invoke Xerces to parse a document from command line then you can do that though one of its sample applications: http://xerces.apache.org/xerces2-j/samples.html
Best Regards, George --------------------------------------------------------------------- George Cristian Bina <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger http://www.oxygenxml.com

Dear Andrew, You guessed correctly, specifying the parser class when you perform the transformation makes the XSLT engine use that parser but this does not turn on validation, thus you only get a wellformed check. Xerces does not have a command line utility. However, the Xerces samples contain a number of example classes that can be invoked from command line. See http://xerces.apache.org/xerces2-j/samples.html For instance you can use the sax.Counter sample: http://xerces.apache.org/xerces2-j/samples-sax.html#Counter Note that you need to download a Xerces distribution to get also the samples jar that needs to be in the classpath together with the xercesImpl.jar and xml-apis.jar. A caveat here is that you cannot enable the catalog support from the available command line options. Best Regards, George --------------------------------------------------------------------- George Cristian Bina <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger http://www.oxygenxml.com Andrew Rouner wrote:
Hello,
I am looking for the right syntax and method to be able to batch-parse XML files from the command line using Xerces. I need to use Xerces as I am attempting to replicate parsing using oXygen (which has Xerces as its default parser). If anyone can send along the syntax for doing this or can point me to a resource that can help, I'd very much appreciate it.
I previously used xmllint/LIBXML to do command line parsing of my TEI files, which worked well for files calling on the TEI xlite DTD. I am now dealing with files that use the full TEI and must rely on the xml catalog, i.e.:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [ <!ENTITY % TEI.XML 'INCLUDE'> <!ENTITY % TEI.mixed 'INCLUDE'> <!ENTITY % TEI.drama 'INCLUDE'> <!ENTITY % TEI.corpus 'INCLUDE'> <!ENTITY % TEI.prose 'INCLUDE'> <!ENTITY % TEI.figures 'INCLUDE'> <!ENTITY % TEI.linking 'INCLUDE'> <!ENTITY % TEI.transcr 'INCLUDE'> <!ENTITY % TEI.names.dates 'INCLUDE'> <!ENTITY % TEI.spoken 'INCLUDE'> <!ENTITY % TEI.header 'INCLUDE'> <!ENTITY % ISOlat1 SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat1.ent'> %ISOlat1; <!ENTITY % ISOlat2 SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat2.ent'> %ISOlat2; <!ENTITY % ISOnum SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-num.ent'> %ISOnum; <!ENTITY % ISOpub SYSTEM 'http://www.tei-c.org/Entity_Sets/Unicode/iso-pub.ent'> %ISOpub; ]>
I need to use Xerces, because I find that the default parser in oXygen (which is Xerces) can successfully parse these files (and LIBXML does not work for files using the full TEI due to problems with the DTD).
My best understanding (which may be completely off) is that to use Xerces as an XML parser in the command line, what I am essentially doing, is using the syntax to run an XML file through an XSL stylesheet (on the assumption that the source file has to validate to run successfully.
I have modified a previous stylesheet that processes all TEI elements found in these documents, and I use this syntax:
java com.icl.saxon.StyleSheet -x org.apache.xerces.parsers.SAXParser source_file.xml stylesheet.xsl > /dev/null
I am using Xerces as it comes with oXygen (and have not downloaded it separately). Since I am only really interested in parsing and not the output, I pipe it to /dev/null. I have the following in my bash profile for the PATH:
CLASSPATH=$CLASSPATH:/Applications/oxygen/lib/saxon.jar:\ /Applications/oxygen/frameworks/docbook/xsl/extensions/saxon653.jar.ext:/App lications/oxygen/lib/xercesImpl.jar export CLASSPATH
The above command WORKS, and will pick up SOME errors, but is clearly missing others. Does anyone have any more straightforward syntax for just PARSING with Xerces, or have any ideas why some errors (I have tested) are not being reported through this process? (One possibility is that it's just checking well-formedness, not validity, which I need to test further.)
Thanks in advance for any help/suggestions.
Andrew
Andrew Rouner Digital Library Services Washington University Libraries St. Louis, MO
EMAIL: arouner@wustl.edu
From: Oxygen XML Editor support <support@oxygenxml.com> Date: Tue, 25 Jul 2006 12:47:23 +0300 To: Andrew Rouner <arouner@wustl.edu> Subject: Re: Differences in validators/ dtd problems?
Dear Andrew Rouner,
Thank you for contacting us. The default parser used by oXygen is Xerces 2.8.0 (that is the latest Xerces version). This looks at a first glance like a problem/bug in XMLLINT. If you want to invoke Xerces to parse a document from command line then you can do that though one of its sample applications: http://xerces.apache.org/xerces2-j/samples.html
Best Regards, George --------------------------------------------------------------------- George Cristian Bina <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger http://www.oxygenxml.com
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user
participants (2)
-
Andrew Rouner
-
George Cristian Bina