Our eXist Migration Journey from Version 1.4.2 to 4.7.1

What is eXist: Please refer to the eXist-db homepage.

My personal definition is:

A highly efficient NoSQL database for storing and retrieving XML documents. And as a bonus, the XML documents can be queried using XQuery.

(Please note that this definition does not take the application platform aspect into account.)

Background

We at Bison Schweiz AG (https://github.com/BisonSchweizAG, https://www.bison-group.com/) have been using eXist 1.4.2 in production for about 10 years. By not upgrading for so long, we accumulated a good amount of technical debt. And the people who initially took care of the integration left our team. Some of them even had claimed – after a deep analysis – that upgrading was not possible …
So it was high time to jump to a future proof version of eXist, and – not at last – one that runs on a modern version of Java.

Here is the list of the brave souls who started the upgrade project in February 2019:

Magdalena and Wolfgang joined us on site for a few days, to help ignite the project.

In order to understand some of the decisions we made and problems we faced, let’s have a look at our development and production setup. We have 4 development (master) instances of eXist. The XML documents in there are called “component configurations” (more on that later). The 4 teams working with these instances all use the same tools and API’s to interact with eXist. Those tools are developed by the framework team, which the people listed above are members of. The 4 teams are not directly involved with the upgrade, they simply expect that the brave souls will do it for them.
Each night, the following steps are executed for all 4 development lines, without user interaction:

  • Take a copy of a customer system (relational database, eXist, deployed application)
  • Upgrade it to the tip of the development (this is a full release process, like the one that happens in production: update the relational database, update eXist, update the application)
  • Run all integration and end to end tests
  • Report any successful and failed steps

The four development lines – let’s call them B, F, L and S – have a different number of customer installations:

  • B: around 12 customers
  • F: 2 customers
  • L: between 150 and 200 customers
  • S: 1 customer

All these customers run on Windows and use the Windows Service to start and stop eXist.

Now I try to tell the story in – mostly – chronological order.

Goal and Plan

The goal was very easy to formulate: Around 200 customer installations, each having its own running instance of eXist, need to be updated without any user interaction and in a reasonable amount of time. B, F and S usually update exactly one customer overnight. L updates batches of around 25 to 40 customers in one evening, all updates running in parallel.

The plan was quite unsurprising: First adjust the framework code, then pass all unit tests, and afterwards pass all integration tests. Then tackle the (izpack based, http://izpack.org/) release process. And as soon this would run unattended, tackle the full nightly upgrade with all end to end tests. All this would happen on a separate feature branch and on a separate ‘nightly lane’ where we could reiterate as fast and as often as machine time allowed, and without disturbing the 4 development teams. However it is worth noting that the feedback time would increase with the amount of goals we achieved: From seconds (for unit tests) over minutes (the fastest integration tests) to one hour (all integration tests) and in the end several hours for the whole roundtrip.

Magdalena and Wolfgang at Our Office

It was clear from the beginning that the source and the target version of eXist are binary incompatible. We knew that this was not a trivial task, and that we needed some help from experts. So we made an appointment with Wolfgang. The goal was to ignite the project, and to get a feeling for the real difficult parts ahead. Wolfgang had helped us a few times in the past, and probably would remember some of our integrations with eXist. In the weeks before, we started the feature branch and tried to get the framework compiled with the new version of eXist. This worked out quite well. Some internal eXist classes we used had been replaced in the new version, and we needed Wolfgang to help us find a workaround. The functionality in question was the Java update of a single node during an XQuery. Wolfgang told us that this is completely internal and not guaranteed to be supported in future versions. Therefore we planned to replace those updates with pure XQueries, but for that time it was a big relief to have a workaround. Magdalena kindly converted the first handful of those XQueries for us, so that we later on could use them as example templates to convert the rest.

A funny anecdote I cannot hold myself to tell is the exchange of files. Wolfgang used to build snapshot versions of eXist 4.7.x and to “mavenize” them for us, suitable for our internal JFrog artifactory (https://jfrog.com/artifactory/). The transfer of those .jar files was possible with external USB drives. But in the other direction, Wolfgang needed a full backup of our eXist instance. This .zip file did not fit onto the drives we had at hand. The administrative process of giving Wolfgang access to our internal network would have lasted at least 2 weeks, if successful at all… Luckily in the end I was able to:

  • Make the backup on our windows development server
  • Copy it to a shared network drive
  • Copy it to an external SSD drive, NTFS formatted
  • Using Paragon (https://www.paragon-software.com/home/ntfs-mac/), copy it to my private MacBook
  • Using Airdrop, transfer it to Wolfgang’s MacBook
  • Brave new world!

Of course the collaboration on code level was much easier: We used repositories on Github (https://github.com/).

For unit and integration tests, we often spawn up a short-lived embedded instance of eXist in a temporary folder. We used a configuration file with almost the same settings like our old one. Especially for the embedded case, we used to turn off recovery. But eXist 4.7.1 did not start with disabled recovery, which led to one of our first pull requests (2538 and 2539, see Appendix). To be able to continue immediately, we of course turned on recovery first.
Another configuration setting we used was (XQuery) timeout -1. In the old version this meant “no timeout at all”. A short research together with Wolfgang made clear that this option did not survive. The solution was to use a timeout of two hours both in the configuration and in all places we set it on XQuery level. It turned out this was by far long enough. Maybe we had very mean XQueries in the past to have the need for an infinite timeout.

The Most Feared Features – and Good Bye

While Magdalena and Wolfgang were still in our office, we wanted to tackle two of our most feared features:

  • The resolve function
  • Backup / restore

According to Wolfgang, backup and restore should be no technical problem. The internal backup format is compatible between otherwise binary incompatible versions of eXist. We had a look at our old installation code, and decided to write an XQuery for this task. Magdalena started, and with the help from Wolfgang was able to proof that the XQuery indeed did a conversion from a 1.4.2 to a 4.7.1 instance! So we were very confident that in a later stage, we only had to replace the old upgrade logic with this XQuery.

The resolve function is an org.exist.xquery.BasicFunction, a possibility to call Java code from XQuery. You can find a simple example of an echo function here: https://exist-db.org/exist/apps/doc/xquery#calling-java. Our resolve function crawls the inheritance tree of our component configurations. Simply put, it has to merge multiple XML documents into one. To make it a bit more complex, there is a second dimension: some documents can refer to others with an “inline” directive. The resolve function itself uses a combination of Java code and XQuery. Wolfgang and I knew that this had led to some nasty technical problems in the past. Luckily the resolve function is quite well covered by unit and integration tests. After rewriting some central parts to the new API’s, we were able to prove that at least those tests pass again. We also could verify that some of the GUI components were rendered correctly.

Just in time for the end of Magdalena’s and Wolfgang’s visit, we seemed to have tamed the beasts.

Bye bye Magdalena and Wolfgang, and many thanks for your support for a great open source product!

Building Our Own Version of eXist

We felt that we might need one or two more snapshots of eXist to experiment with, so I semi-automated the build:

  • Build artefacts including installer (call the provided ant build)
  • Adjust/create the .pom files and create the *-sources.jar (call a python script)
  • Upload everything to our internal JFrog artifactory (by hand – please forgive me)

Log4j

When starting our local application container (WildFly: https://wildfly.org/), we noticed that logging was partly corrupt. By pulling in eXist’s dependencies we had deployed an unhealthy combination of log4j 1.x and log4j 2.x libraries. This was clearly our fault, we had made some shortcuts instead of a clear separation. This was not obvious to get right, so we decided to upgrade the whole application to log4j 2 (which was planned sooner or later anyway). In theory this was no problem, in practice we had used some internal classes of log4j 1 in our framework code. This had to be rewritten and adapted to the slf4j bridges. And last but not least, it took some trials to really get the configuration right. Hail to the universe of Java logging frameworks!

The Legacy Installer

Every rule you break,
every shortcut you take,
the code will be hitting you

Our existing installer performed roughly the following tasks:

  1. Install the new eXist binary version
  2. Upgrade the customer’s data (component configuration)
  3. Adjust the windows service

The upgrade of the customer data was done by firing up embedded eXist instances and transferring documents between them.

The New Installer – Document Based Transfer

So (1) was easy, just calling the eXist installer. For (3) we had some code dealing with windows services. But wait: 1.4.2 used the tanuki wrapper, and 4.7.1 yajsw! Time to write some upgrade code parsing the old configuration into the new one while preserving the service name (a requirement from our devops team). This sounds easy in theory, but turned out to be quite a challenge when performed unattended on real world windows servers.

For (2) the first idea was to bring all data into the new format by doing a full backup and restore, and afterwards transferring document by document as we did before. But while this was quite fast with 1.4.2, the new version took way longer than expected. Issue https://github.com/eXist-db/exist/issues/2592 describes what was happening. Patrick was able to implement a faster version again, which resulted in pull request https://github.com/eXist-db/exist/pull/2621 (and some following ones fixing minor issues). But we were still not happy with the installation time this needed.

During all these iterations it became clear that the initial backup/restore implementation in XQuery was too heavyweight for us. Imagine this cascade:

  • Start the installer process
  • Start a Java process
  • Call XQuery to start another Java process doing backup and restore.

This turned out to be almost impossible to debug.

Another small stumbling stone we noticed: If you backup an 1.4.2 instance and restore it into an empty 4.1.7 instance, all the dashboard apps are missing.

The New Installer – Backupfilter Based Restore

Let’s summarize the observations from the first trial at the installation process (2):

  • Out of the box backup/restore is an all or nothing deal, too coarse-grained
  • Reading/writing single documents is too fine-grained

The eXist full backup format on disk is quite nice: It can be read and processed with a ZipInputStream. This lead to the idea of a backup filter, a little tool that produces subsets of full backups. These subsets can be tailored exactly to our needs. And the best news is: eXist allows to restore multiple backups into one instance, one after the other. Since we produce non-overlapping subsets, there are no problems with overwriting documents.

As a bonus, we wrote a diff tool comparing two backups. This helped a lot to produce a subset containing only the changed documents since X, and to speed up our patch installations.
Pro tip: keep your __contents__.xml files up to date with the subset!

Now we had all the building blocks ready to orchestrate the update as we wanted:

  • Inhouse, prepare a filtered full backup containing /apps and all of our immutable component configurations (called delivery)
  • At the customer’s site, prepare a filtered full backup containing all the writable component configurations (called custom)
  • Restore both delivery and custom into the fresh empty target instance

Towards the Inhouse Nightly Upgrade

Now it was time to put all the single, manually well tested steps together and test them out in our canary nightly build. And it failed. The fresh installed empty instance did not have the admin password we expected. Unattended installation was not able to set it like the manual one. Luckily we were able to fix this with https://github.com/eXist-db/exist/pull/2865.

This got us a step further. But in some runs, the backup did not terminate at all. The reason was a deadlock, addressed by https://github.com/eXist-db/exist/issues/2893.

Now the installation process ran through, without any user interaction. And the end to end tests could start. Most of them ran fine, but we got a lot of exceptions when closing the eXist remote collection. The reason was that the remote collection had no reliable way of telling if it was still open or already closed. And it did not allow multitple close() calls. Pull request https://github.com/eXist-db/exist/pull/2881 fixed that.

Another smaller homemade issue popped up: Some XQueries failed because they did not find a user defined function. It turned out that import statements now always need the fully qualified name of the function. Easy fix.

With this, we were ready to upgrade the first of the 4 inhouse development systems to the new eXist version. Because this was a quite unique process (a short development downtime, update of Jenkins jobs, upgrading all libraries to the required version), we decided to not automate it. This was a good decision.
One of the biggest mindset changes we had people to convince of: Backing up/restoring by simply zipping/unzipping the data folder was no longer an option. To be fair to Wolfgang, according to him this had never been an option anyway …

Suprises on Memory and Space Restricted Systems

The setup of the customers of development line L is very constrained: On a virtual Windows machine, there are at least 3 and up to 6 parallel installations. This means both one Java process for the WildFly container and for eXist per installation. And they update them all at the same time! Of course the productive services are shut down before the installations start, but nonetheless there are 3 to 6 not so small installation processes competing for resources.

The binaries for the installer itself are not always on the same drive as the installed binaries. But the yajsw .bat files assumed everybody is on the same drive. Let’s fix that with https://github.com/eXist-db/exist/pull/3137.

We also saw ConcurrentModificationExceptions in BrokerPool, probably better visible in restricted environments. Time for another pull request: https://github.com/eXist-db/exist/pull/3146.

And – last but not least – the Windows pagefile was freaking out. The message in the Windows event log was saying “Out of virtual memory”. If you google a bit, people lead you to check your pagefile settings. We experimented with dynamic and fixed space allocation and with different sizes of the pagefile. As you can imagine, those frequent changes made us close friends with some system administrators. On some virtual machines it helped, on others there was still no successful installation possible. Having no idea what was going on, we decided to profile one of 4 installations running in parallel. (Side note: If you ever tried to profile a Java process running in priviledged mode on a virtual Windows system, you feel with us). We managed to attach a jvisualvm. We monitored the memory, and we saw: Stairway to Heaven. Well not exactly, some of the memory was freed from time to time, but the overall tendency was clearly going up all the time. On some systems, Windows at one point said: “not with me”, and killed the process.

We went back and tried to simulate this on our internal machines: We did a loop of several backups/restores in the same JVM. We even used Java 11 with Java Flight Recorder. In combination with heap dumps we were able to spot the problem: A memory leak during shutdown. This is doing no harm if the JVM is shut down as well, but in our installation case we keep the JVM alive and do several startup/shutdowns in a row. Patrick was able to fix this with https://github.com/eXist-db/exist/pull/3169.

Summary

By now (in the first half of 2020), we can say that all of our ~ 200 customers were upgraded to the new version in a fully automated way. Well, to be completely honest: all with the exception of less than a handful customers of development line B – they are waiting to upgrade the application for reasons not related to eXist.
During the development of the upgrade process, we were able to contribute fixes to eXist in form of pull requests, which were very well received. This way both the open source project and our application got better, and they both benefited from each other. I personally think that this is the true spirit of open source!

Many thanks again to all the awesome people participating in the exist-db project!

We plan to upgrade to a 5.x version of eXist soon, in order to circumvent such a giant leap. We strongly believe that small steps are easier to handle, and need less time and energy over all.

A personal note at the end: I know that my memory is getting weaker as time passes on. So if you should find some inaccuracies in the story above, please do not hesitate to contact me for an improvement. I left out some internal (not directly related to eXist) difficulties we faced. Thanks a lot for reading this far – Oti.

Appendix: List of Pull Requests

Whenever we felt a change to the eXist code might be necessary, we implemented it on our fork, built the next snapshot and tested it out in our tool chain. Once certain the change had the desired effect, we filed a pull request. In extreme cases like memory leaks, the verification could only be done in production (or on an internal clone of a production system). To avoid future regressions, we always tried to improve both the develop-4.x.x and the develop branch, sometimes also the develop-5.0.0 branch.

2538 Start fine with recovery disabled
https://github.com/eXist-db/exist/pull/2538

2539 (5.0.0) Start fine with recovery disabled
https://github.com/eXist-db/exist/pull/2539

2621 (4.x.x) Implements VirtualTempPath as described in #2592
https://github.com/eXist-db/exist/pull/2621

2630 (5.0.0) Implements VirtualTempPath as described in #2592
https://github.com/eXist-db/exist/pull/2630

2639 (4.x.x) Remove illegal unicode character
https://github.com/eXist-db/exist/pull/2639

2641 (5.0.0) Remove illegal unicode character
https://github.com/eXist-db/exist/pull/2641

2746 (5.0.0) Fixes the default in memory size from 64M to 4M
https://github.com/eXist-db/exist/pull/2746

2747 (4.x.x) Fixes the default in memory size from 64M to 4M
https://github.com/eXist-db/exist/pull/2747

2761 (5.0.0) Fixes getBytes() method returning wrong data if switched to file
https://github.com/eXist-db/exist/pull/2761

2766 (4.x.x) Fixes getBytes() method returning wrong data if switched to file
https://github.com/eXist-db/exist/pull/2766

2865 (5.0.0) Fixes unattended installation with data directory & admin password
https://github.com/eXist-db/exist/pull/2865

2869 (4.x.x) Fixes ignored data directory
https://github.com/eXist-db/exist/pull/2869

2881 Implements isOpen() / close() methods on RemoteCollection
https://github.com/eXist-db/exist/pull/2881

2893 Race condition when invoking org.exist.backup.ExportMain
https://github.com/eXist-db/exist/issues/2893

2896 (4.x.x) Use the correct name for the endorsed Saxon-HE.jar on Windows, too.
https://github.com/eXist-db/exist/pull/2896

3137 (4.x.x) Switch to script drive before calling subsequent commands
https://github.com/eXist-db/exist/pull/3137

3145 Fix concurrent modification shutting down multiple broker pools
https://github.com/eXist-db/exist/pull/3145

3146 (4.x.x) Fix concurrent modification shutting down multiple broker pools
https://github.com/eXist-db/exist/pull/3146

3153 Fixes illegal characters in path
https://github.com/eXist-db/exist/pull/3153

3154 (4.x-x) [bugfix] Fix illegal characters in directory name
https://github.com/eXist-db/exist/pull/3154

3159 (4.x.x) Add a test to assert the integrity of installer/jobs.xml
https://github.com/eXist-db/exist/pull/3159

3169 (4.x.x) [bugfix] Memory leak on shutdown
https://github.com/eXist-db/exist/pull/3169

3170 [bugfix] Memory leak on shutdown
https://github.com/eXist-db/exist/pull/3170

My son uses Jython!

Yesterday my son came home from school and told me that he has to learn programming with Python. “We downloaded an app where learning is easy – you can start out of the box”, he said. That made me curious and I asked him to show what he had done so far.

They downloaded TigerJython (http://jython.tobiaskohn.ch/index.html), which uses Jython 2.7.0 under the hood. The exercises he made were steering a turtle to draw lines. Pretty cool for beginners!

I offered him to help if he should be stuck. He had 2 questions:

  • can I define a function with more than one argument?
  • can I share functions between two .py files?

So we wrote the first module to import from. This worked well, and you got here a somewhat proud father that helped build something his son can use today…

 

 

Attention with starting vncserver on Ubuntu

At work I have an Ubuntu machine which is always up and running. Therefore I want to connect to it from everywhere. The idea is to have some sort of remote desktop / desktop sharing tool.

NoMachine stopped working for me a couple of years ago. And I admit I never tried the newest version since then.

xrdp does work, but I never got the keyboard mapping right.

Today I tried vncserver, and wondered why the installation description(s) let you create a separate user. Now I know:
I called vncserver with my regular account. Besides not being able to connect from the remote client, I was not able to login to the Ubuntu desktop natively any more! After a long despair time, and with the help of my colleagues (Ctrl-Alt-F1 for console login…), I found the root cause in the logs: My ~/.Xauthority file was owned by root, and my regular user had no rights to change it during login.
My advice to you is: create a separate user if you really feel you should try out vncserver.

So finally I gave X2Go a go, and this worked! The Windows client often crashes when you manage your sessions, but once all is set up it works as desired.

Please be aware: Your remote mileage may vary …

Concat two git repositories in fast-import format

Today I spent a few hours wrapping my head around the documentation of git fast-import, especially the part of restarting an incremental import from the current branch value. The documentation says it should be written as:

from refs/heads/branch^0

or, since the current branch usually is master, as

from refs/heads/master^0

There is also a short answer on nabble. But I could not find an example of how to really edit the input file.

The solution is simple:
Edit the first commit of the second input file without adding a new blank line (the file is named ‘research-git-dump.dat’ in my case).

Before:

commit refs/heads/master
mark :1000000000
committer Bubba Gump <bubba.gump@shrimps.com> 1206025332 +0000
data 6
my commit comment

M 100644 :2430 MyProject/myfile1
M 100644 :2439 MyProject/myfile2
:

After:

commit refs/heads/master
mark :1000000000
committer Bubba Gump <bubba.gump@shrimps.com> 1206025332 +0000
data 6
my commit comment
from refs/heads/master^0
M 100644 :2430 MyProject/myfile1
M 100644 :2439 MyProject/myfile2
:

Doing this, the second import went like a charm.
I was able to concat two git repositories created by cvs2git:

$ git init --bare merged.git
$ cd merged.git/
$ cat ../custom-git-blob.dat ../custom-git-dump.dat | git fast-import
$ cat ../research-git-blob.dat ../research-git-dump.dat | git fast-import
$ gitk

… and gitk was happy!

Agile Architecture

The last 3 days I had the pleasure of attending an Architecture Workshop hold by Stefan Toth (@st_toth). We were designing and building a Lego Mindstorm ball path in 3 iterations.

AgileArchitecture

He challenged us with difficult requirements, time restrictions, risk based priorities and much more. The team had great fun and learned a lot about the daily architecture work on the way.

Thanks, Stefan!

P.S.
You can read more about Stefan here (in German)

Dear Reader,

to be honest, I only have a vague idea why I should and what I could write here.
Sometimes my thoughts move away to a special event, idea or finding from the last day(s). Maybe there is a value in writing it down?