What is eXist: Please refer to the eXist-db homepage.
My personal definition is:
A highly efficient NoSQL database for storing and retrieving XML documents. And as a bonus, the XML documents can be queried using XQuery.
(Please note that this definition does not take the application platform aspect into account.)
We at Bison Schweiz AG (https://github.com/BisonSchweizAG, https://www.bison-group.com/) have been using eXist 1.4.2 in production for about 10 years. By not upgrading for so long, we accumulated a good amount of technical debt. And the people who initially took care of the integration left our team. Some of them even had claimed – after a deep analysis – that upgrading was not possible …
So it was high time to jump to a future proof version of eXist, and – not at last – one that runs on a modern version of Java.
Here is the list of the brave souls who started the upgrade project in February 2019:
- Wolfgang Meier: https://github.com/wolfgangmm (eXist Solutions http://existsolutions.com/index.html)
- Magdalena Turska: https://github.com/tuurma (eXist Solutions)
- Patrick Reinhart: https://github.com/reinhapa
- Christoph Erni
- Otmar Humbel: https://github.com/ohumbel (the author)
Magdalena and Wolfgang joined us on site for a few days, to help ignite the project.
In order to understand some of the decisions we made and problems we faced, let’s have a look at our development and production setup. We have 4 development (
master) instances of eXist. The XML documents in there are called “component configurations” (more on that later). The 4 teams working with these instances all use the same tools and API’s to interact with eXist. Those tools are developed by the framework team, which the people listed above are members of. The 4 teams are not directly involved with the upgrade, they simply expect that the brave souls will do it for them.
Each night, the following steps are executed for all 4 development lines, without user interaction:
- Take a copy of a customer system (relational database, eXist, deployed application)
- Upgrade it to the tip of the development (this is a full release process, like the one that happens in production: update the relational database, update eXist, update the application)
- Run all integration and end to end tests
- Report any successful and failed steps
The four development lines – let’s call them B, F, L and S – have a different number of customer installations:
- B: around 12 customers
- F: 2 customers
- L: between 150 and 200 customers
- S: 1 customer
All these customers run on Windows and use the Windows Service to start and stop eXist.
Now I try to tell the story in – mostly – chronological order.
Goal and Plan
The goal was very easy to formulate: Around 200 customer installations, each having its own running instance of eXist, need to be updated without any user interaction and in a reasonable amount of time. B, F and S usually update exactly one customer overnight. L updates batches of around 25 to 40 customers in one evening, all updates running in parallel.
The plan was quite unsurprising: First adjust the framework code, then pass all unit tests, and afterwards pass all integration tests. Then tackle the (izpack based, http://izpack.org/) release process. And as soon this would run unattended, tackle the full nightly upgrade with all end to end tests. All this would happen on a separate feature branch and on a separate ‘nightly lane’ where we could reiterate as fast and as often as machine time allowed, and without disturbing the 4 development teams. However it is worth noting that the feedback time would increase with the amount of goals we achieved: From seconds (for unit tests) over minutes (the fastest integration tests) to one hour (all integration tests) and in the end several hours for the whole roundtrip.
Magdalena and Wolfgang at Our Office
It was clear from the beginning that the source and the target version of eXist are binary incompatible. We knew that this was not a trivial task, and that we needed some help from experts. So we made an appointment with Wolfgang. The goal was to ignite the project, and to get a feeling for the real difficult parts ahead. Wolfgang had helped us a few times in the past, and probably would remember some of our integrations with eXist. In the weeks before, we started the feature branch and tried to get the framework compiled with the new version of eXist. This worked out quite well. Some internal eXist classes we used had been replaced in the new version, and we needed Wolfgang to help us find a workaround. The functionality in question was the Java update of a single node during an XQuery. Wolfgang told us that this is completely internal and not guaranteed to be supported in future versions. Therefore we planned to replace those updates with pure XQueries, but for that time it was a big relief to have a workaround. Magdalena kindly converted the first handful of those XQueries for us, so that we later on could use them as example templates to convert the rest.
A funny anecdote I cannot hold myself to tell is the exchange of files. Wolfgang used to build snapshot versions of eXist 4.7.x and to “mavenize” them for us, suitable for our internal JFrog artifactory (https://jfrog.com/artifactory/). The transfer of those
.jar files was possible with external USB drives. But in the other direction, Wolfgang needed a full backup of our eXist instance. This
.zip file did not fit onto the drives we had at hand. The administrative process of giving Wolfgang access to our internal network would have lasted at least 2 weeks, if successful at all… Luckily in the end I was able to:
- Make the backup on our windows development server
- Copy it to a shared network drive
- Copy it to an external SSD drive, NTFS formatted
- Using Paragon (https://www.paragon-software.com/home/ntfs-mac/), copy it to my private MacBook
- Using Airdrop, transfer it to Wolfgang’s MacBook
- Brave new world!
Of course the collaboration on code level was much easier: We used repositories on Github (https://github.com/).
For unit and integration tests, we often spawn up a short-lived embedded instance of eXist in a temporary folder. We used a configuration file with almost the same settings like our old one. Especially for the embedded case, we used to turn off recovery. But eXist 4.7.1 did not start with disabled recovery, which led to one of our first pull requests (
2539, see Appendix). To be able to continue immediately, we of course turned on recovery first.
Another configuration setting we used was (XQuery)
timeout -1. In the old version this meant “no timeout at all”. A short research together with Wolfgang made clear that this option did not survive. The solution was to use a timeout of two hours both in the configuration and in all places we set it on XQuery level. It turned out this was by far long enough. Maybe we had very mean XQueries in the past to have the need for an infinite timeout.
The Most Feared Features – and Good Bye
While Magdalena and Wolfgang were still in our office, we wanted to tackle two of our most feared features:
- The resolve function
- Backup / restore
According to Wolfgang, backup and restore should be no technical problem. The internal backup format is compatible between otherwise binary incompatible versions of eXist. We had a look at our old installation code, and decided to write an XQuery for this task. Magdalena started, and with the help from Wolfgang was able to proof that the XQuery indeed did a conversion from a 1.4.2 to a 4.7.1 instance! So we were very confident that in a later stage, we only had to replace the old upgrade logic with this XQuery.
The resolve function is an
org.exist.xquery.BasicFunction, a possibility to call Java code from XQuery. You can find a simple example of an echo function here: https://exist-db.org/exist/apps/doc/xquery#calling-java. Our resolve function crawls the inheritance tree of our component configurations. Simply put, it has to merge multiple XML documents into one. To make it a bit more complex, there is a second dimension: some documents can refer to others with an “inline” directive. The resolve function itself uses a combination of Java code and XQuery. Wolfgang and I knew that this had led to some nasty technical problems in the past. Luckily the resolve function is quite well covered by unit and integration tests. After rewriting some central parts to the new API’s, we were able to prove that at least those tests pass again. We also could verify that some of the GUI components were rendered correctly.
Just in time for the end of Magdalena’s and Wolfgang’s visit, we seemed to have tamed the beasts.
Bye bye Magdalena and Wolfgang, and many thanks for your support for a great open source product!
Building Our Own Version of eXist
We felt that we might need one or two more snapshots of eXist to experiment with, so I semi-automated the build:
- Build artefacts including installer (call the provided
- Adjust/create the
.pomfiles and create the
- Upload everything to our internal JFrog artifactory (by hand – please forgive me)
When starting our local application container (WildFly: https://wildfly.org/), we noticed that logging was partly corrupt. By pulling in eXist’s dependencies we had deployed an unhealthy combination of
log4j 1.x and
log4j 2.x libraries. This was clearly our fault, we had made some shortcuts instead of a clear separation. This was not obvious to get right, so we decided to upgrade the whole application to
log4j 2 (which was planned sooner or later anyway). In theory this was no problem, in practice we had used some internal classes of
log4j 1 in our framework code. This had to be rewritten and adapted to the
slf4j bridges. And last but not least, it took some trials to really get the configuration right. Hail to the universe of Java logging frameworks!
The Legacy Installer
Every rule you break,
every shortcut you take,
the code will be hitting you…
Our existing installer performed roughly the following tasks:
- Install the new eXist binary version
- Upgrade the customer’s data (component configuration)
- Adjust the windows service
The upgrade of the customer data was done by firing up embedded eXist instances and transferring documents between them.
The New Installer – Document Based Transfer
So (1) was easy, just calling the eXist installer. For (3) we had some code dealing with windows services. But wait: 1.4.2 used the
tanuki wrapper, and 4.7.1
yajsw! Time to write some upgrade code parsing the old configuration into the new one while preserving the service name (a requirement from our devops team). This sounds easy in theory, but turned out to be quite a challenge when performed unattended on real world windows servers.
For (2) the first idea was to bring all data into the new format by doing a full backup and restore, and afterwards transferring document by document as we did before. But while this was quite fast with 1.4.2, the new version took way longer than expected. Issue https://github.com/eXist-db/exist/issues/2592 describes what was happening. Patrick was able to implement a faster version again, which resulted in pull request https://github.com/eXist-db/exist/pull/2621 (and some following ones fixing minor issues). But we were still not happy with the installation time this needed.
During all these iterations it became clear that the initial backup/restore implementation in XQuery was too heavyweight for us. Imagine this cascade:
- Start the installer process
- Start a Java process
- Call XQuery to start another Java process doing backup and restore.
This turned out to be almost impossible to debug.
Another small stumbling stone we noticed: If you backup an 1.4.2 instance and restore it into an empty 4.1.7 instance, all the dashboard apps are missing.
The New Installer – Backupfilter Based Restore
Let’s summarize the observations from the first trial at the installation process (2):
- Out of the box backup/restore is an all or nothing deal, too coarse-grained
- Reading/writing single documents is too fine-grained
The eXist full backup format on disk is quite nice: It can be read and processed with a
ZipInputStream. This lead to the idea of a backup filter, a little tool that produces subsets of full backups. These subsets can be tailored exactly to our needs. And the best news is: eXist allows to restore multiple backups into one instance, one after the other. Since we produce non-overlapping subsets, there are no problems with overwriting documents.
As a bonus, we wrote a diff tool comparing two backups. This helped a lot to produce a subset containing only the changed documents since X, and to speed up our patch installations.
Pro tip: keep your __
contents__.xml files up to date with the subset!
Now we had all the building blocks ready to orchestrate the update as we wanted:
- Inhouse, prepare a filtered full backup containing
/appsand all of our immutable component configurations (called delivery)
- At the customer’s site, prepare a filtered full backup containing all the writable component configurations (called custom)
- Restore both delivery and custom into the fresh empty target instance
Towards the Inhouse Nightly Upgrade
Now it was time to put all the single, manually well tested steps together and test them out in our canary nightly build. And it failed. The fresh installed empty instance did not have the admin password we expected. Unattended installation was not able to set it like the manual one. Luckily we were able to fix this with https://github.com/eXist-db/exist/pull/2865.
This got us a step further. But in some runs, the backup did not terminate at all. The reason was a deadlock, addressed by https://github.com/eXist-db/exist/issues/2893.
Now the installation process ran through, without any user interaction. And the end to end tests could start. Most of them ran fine, but we got a lot of exceptions when closing the eXist remote collection. The reason was that the remote collection had no reliable way of telling if it was still open or already closed. And it did not allow multitple
close() calls. Pull request https://github.com/eXist-db/exist/pull/2881 fixed that.
Another smaller homemade issue popped up: Some XQueries failed because they did not find a user defined function. It turned out that import statements now always need the fully qualified name of the function. Easy fix.
With this, we were ready to upgrade the first of the 4 inhouse development systems to the new eXist version. Because this was a quite unique process (a short development downtime, update of Jenkins jobs, upgrading all libraries to the required version), we decided to not automate it. This was a good decision.
One of the biggest mindset changes we had people to convince of: Backing up/restoring by simply zipping/unzipping the data folder was no longer an option. To be fair to Wolfgang, according to him this had never been an option anyway …
Suprises on Memory and Space Restricted Systems
The setup of the customers of development line L is very constrained: On a virtual Windows machine, there are at least 3 and up to 6 parallel installations. This means both one Java process for the WildFly container and for eXist per installation. And they update them all at the same time! Of course the productive services are shut down before the installations start, but nonetheless there are 3 to 6 not so small installation processes competing for resources.
The binaries for the installer itself are not always on the same drive as the installed binaries. But the
.bat files assumed everybody is on the same drive. Let’s fix that with https://github.com/eXist-db/exist/pull/3137.
We also saw
BrokerPool, probably better visible in restricted environments. Time for another pull request: https://github.com/eXist-db/exist/pull/3146.
And – last but not least – the Windows pagefile was freaking out. The message in the Windows event log was saying “Out of virtual memory”. If you google a bit, people lead you to check your pagefile settings. We experimented with dynamic and fixed space allocation and with different sizes of the pagefile. As you can imagine, those frequent changes made us close friends with some system administrators. On some virtual machines it helped, on others there was still no successful installation possible. Having no idea what was going on, we decided to profile one of 4 installations running in parallel. (Side note: If you ever tried to profile a Java process running in priviledged mode on a virtual Windows system, you feel with us). We managed to attach a
jvisualvm. We monitored the memory, and we saw: Stairway to Heaven. Well not exactly, some of the memory was freed from time to time, but the overall tendency was clearly going up all the time. On some systems, Windows at one point said: “not with me”, and killed the process.
We went back and tried to simulate this on our internal machines: We did a loop of several backups/restores in the same JVM. We even used Java 11 with Java Flight Recorder. In combination with heap dumps we were able to spot the problem: A memory leak during shutdown. This is doing no harm if the JVM is shut down as well, but in our installation case we keep the JVM alive and do several startup/shutdowns in a row. Patrick was able to fix this with https://github.com/eXist-db/exist/pull/3169.
By now (in the first half of 2020), we can say that all of our ~ 200 customers were upgraded to the new version in a fully automated way. Well, to be completely honest: all with the exception of less than a handful customers of development line B – they are waiting to upgrade the application for reasons not related to eXist.
During the development of the upgrade process, we were able to contribute fixes to eXist in form of pull requests, which were very well received. This way both the open source project and our application got better, and they both benefited from each other. I personally think that this is the true spirit of open source!
Many thanks again to all the awesome people participating in the
We plan to upgrade to a 5.x version of eXist soon, in order to circumvent such a giant leap. We strongly believe that small steps are easier to handle, and need less time and energy over all.
A personal note at the end: I know that my memory is getting weaker as time passes on. So if you should find some inaccuracies in the story above, please do not hesitate to contact me for an improvement. I left out some internal (not directly related to eXist) difficulties we faced. Thanks a lot for reading this far – Oti.
Appendix: List of Pull Requests
Whenever we felt a change to the eXist code might be necessary, we implemented it on our fork, built the next snapshot and tested it out in our tool chain. Once certain the change had the desired effect, we filed a pull request. In extreme cases like memory leaks, the verification could only be done in production (or on an internal clone of a production system). To avoid future regressions, we always tried to improve both the develop-4.x.x and the develop branch, sometimes also the develop-5.0.0 branch.
2538 Start fine with recovery disabled
2539 (5.0.0) Start fine with recovery disabled
2621 (4.x.x) Implements VirtualTempPath as described in #2592
2630 (5.0.0) Implements VirtualTempPath as described in #2592
2639 (4.x.x) Remove illegal unicode character
2641 (5.0.0) Remove illegal unicode character
2746 (5.0.0) Fixes the default in memory size from 64M to 4M
2747 (4.x.x) Fixes the default in memory size from 64M to 4M
2761 (5.0.0) Fixes getBytes() method returning wrong data if switched to file
2766 (4.x.x) Fixes getBytes() method returning wrong data if switched to file
2865 (5.0.0) Fixes unattended installation with data directory & admin password
2869 (4.x.x) Fixes ignored data directory
2881 Implements isOpen() / close() methods on RemoteCollection
2893 Race condition when invoking org.exist.backup.ExportMain
2896 (4.x.x) Use the correct name for the endorsed Saxon-HE.jar on Windows, too.
3137 (4.x.x) Switch to script drive before calling subsequent commands
3145 Fix concurrent modification shutting down multiple broker pools
3146 (4.x.x) Fix concurrent modification shutting down multiple broker pools
3153 Fixes illegal characters in path
3154 (4.x-x) [bugfix] Fix illegal characters in directory name
3159 (4.x.x) Add a test to assert the integrity of installer/jobs.xml
3169 (4.x.x) [bugfix] Memory leak on shutdown
3170 [bugfix] Memory leak on shutdown