Why can't Oxygen format and indent a 200MB file in 30 minutes whereas SAXON can do it in 15 seconds?

Hi Folks, I have (by today's standards) a medium sized XML file that is 200MB in size. It is unformatted (no indentation). I opened the file in Oxygen and clicked on the format and indent button. After 30 minutes of processing Oxygen gave up with an error message. So I wrote a simple 1-line XSLT program (below) to do the indentation, it took about 15 seconds and was done. Why is it that Oxygen can't indent the file in 30 minutes whereas an XSLT processor (Saxon) can do it in 15 seconds? /Roger <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" indent="yes" /> <xsl:template match="/"> <xsl:copy-of select="/" /> </xsl:template> </xsl:stylesheet>

Hello, Like the saying goes, it depends...
I opened the file in Oxygen and clicked on the format and indent button. After 30 minutes of processing Oxygen gave up with an error message. You haven't mentioned what the error message was, but I'm pretty sure Oxygen ran out of memory. Basically it ran out of memory in the first few seconds of formatting then the Java VM struggled to admit this fact for the rest of the 30 minutes. So, first it depends on how much memory Oxygen had available (Help > About, JVM Memory ... Total).
To keep a long story short, if you want Oxygen to format and indent a large file as fast as Saxon and not risk running out of memory in the process, it should to do this either without opening the document or at document opening time. A. Without opening the document Use Tools > "Format and Indent Files" or right click on the document in the Project view and "Format and Indent Files". B. At document opening time 1. Set Options > Preferences > Editor / Format, [x] "Format and indent the document on open". 2. Close the document. 3. Reopen the document (File > Reopen last closed editor / Ctrl+Alt+T) 4. Eventually clear the box for [ ] "Format and indent the document on open" because it will apply to all opened documents. Read on for the juicy details... It is actually a huge difference between how Oxygen (an IDE) and Saxon (a CLI tool) achieve this and what their requirements are for this, even though the result may be the same. I can't really speak for Saxon's inner workings, but it might not even build an XML model into memory depending on Saxon optimizations and if Saxon streaming is used. In theory, if you use an input stream that reads and parses the XML one chunk at a time, and an output stream that writes the XML model as the first one reads, you don't actually have to load the entire thing into memory for the purpose of formatting it. Using Saxon streaming would probably be faster than your result and could work for a file of any size, but I digress. By Oxygen's standards 200MB is a large file (> 30MB). That means some optimizations are enforced to accommodate a file of this size. [1] For 300MB or more, Oxygen has a "huge files" mode that no longer loads the entire document in memory and has some more severe limitations. [2] So this is closer to Oxygen's "huge" limit rather than the "large" limit. Because Oxygen is an IDE, it loads the document in memory as text (with the exceptions/optimizations mentioned above) and then builds various specialized models from the document so that you have all those editing helpers (Outline, Attributes, Model) or a much more complex model if you switch to Author mode. When you format the document while already opened in Text mode, Oxygen parses the XML and serializes it with the configured formatting options. Due to the way the model of a text editor is updated, it is not feasible to make this into a stream and repeatedly update parts of the file (e.g. line by line), so the entire document contents is replaced when the formatting ends. This causes a duplication of the entire document in memory. Oxygen also provides Undo for that formatting in case you don't like it or have triggered it accidentally, so it also has to keep the old document. All of this comes at a high price with regard to memory. Which is what Oxygen usually stumbles upon (running out of memory) when working with large files. So, as much as we would want to make it work with large files, it's just that the amount of memory required to achieve this within an IDE is a number of times larger than Saxon's (assuming Saxon would actually build the entire XML model of that document). The solution is to try and serialize to disk some of the pieces of the puzzle in order to free memory. This is actually what some of the large/huge mode optimizations do, but with limited success. Regards, Adrian [1] https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/large-file-edit... [2] https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/huge-file-edito... Adrian Buza oXygen XML Editor and Author Support On 18.08.2021 17:28, Roger L Costello wrote:
Hi Folks,
I have (by today's standards) a medium sized XML file that is 200MB in size. It is unformatted (no indentation). I opened the file in Oxygen and clicked on the format and indent button. After 30 minutes of processing Oxygen gave up with an error message. So I wrote a simple 1-line XSLT program (below) to do the indentation, it took about 15 seconds and was done. Why is it that Oxygen can't indent the file in 30 minutes whereas an XSLT processor (Saxon) can do it in 15 seconds? /Roger
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" indent="yes" />
<xsl:template match="/"> <xsl:copy-of select="/" /> </xsl:template>
</xsl:stylesheet> _______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com https://www.oxygenxml.com/mailman/listinfo/oxygen-user
participants (2)
-
Oxygen XML Editor Support (Adrian Buza)
-
Roger L Costello