Setting Forth
In early 2017 we migrated a big Plone based intranet to Quaive. The old system was a precursor of Quaive, so it was similar enough, but there were still some differences to straighten out. In particular, a bunch of classes had been moved to new packages. In the case of the workspace content type, the implementation of subtypes had also changed. The old system had one class for all workspaces. An attribute (workspace_type) indicated the type of workspace. Quaive has a separate class for each workspace type.
Old (simplified):
from plone.directives import form
class IWorkspace(form.Schema):
workspace_type = schema.TextLine(
title=_(u'label_workspace_type', u'Workspace type'),
default=u'generic',
)
New: (simplified)
from plone.directives import form
class IWorkspaceFolder(form.Schema):
""" """
class IGenericWorkspaceFolder(IWorkspaceFolder):
""" """
As it was a huge system, a regular migration that creates new objects, moves over the data, reindexes etc. would have taken days. My colleague Alessandro gave a great Presentation about this migration at the Plone Conference 2017, explaining what the problems were and how we decided to go about it in the end. In a nutshell, we ended up changing the __class__ attributes of the existing objects and did a little cleanup afterwards (but not enough, as it turned out).
Very much simplified:
if obj.workspace_type == 'generic':
obj.__class__ = GenericWorkspaceFolder
Now, we did not take it lightly to mess with the __class__ attribute. We did have misgivings about touching such a fundamental property of an object. But we did quite some research and a lot of experimentation and in the end decided that we could get away with it. And for a while we did.
Getting Lost
Months later, when we had long ticked off the migration as a success, things started happening. The first symptom was that workspaces started changing colours. As every workspace type is shown in a specific colour, it was clear that something was off with the types - but we couldn't quite grasp the problem. Sometimes a workspace would be indexed as the wrong type, but if you viewed it directly it seemed fine. Then we realised that when reloading the workspace view it would sometimes throw an error. When no one else was accessing the system, there was some kind of pattern to it: One in seven page loads, or sometimes two in seven, threw an error, and the rest of the time it worked fine. This brought us to the conclusion that the issue was specific to single Zope instances: We have a load balancer that is served by seven Zope instances acting as ZEO clients. And indeed, when calling the workspace view on the instances directly, we got consistent results. It worked on certain instances, and others were always throwing an error.
So now it should have been easy. Grab a broken instance, slap in a debugger, and find the problem. Only that the problem was gone when you did this. Shutting down the Zope instance and starting it in debug mode made the error evaporate! After a lot of confusion and experimentation we concluded that the issue was located in the client cache of the Zope instances. As long as we had persistent cache files (.zec files) on the file system, he problem would carry over between restarts. If the cache files were not persisted, or we deleted them by hand, a restart would make the issue go away. Specifically, when starting an instance in debug mode the persistent cache files are not used and the issue would not manifest itself, making it tricky to debug.
So we had an idea where the problem was hiding, but what was actually causing the trouble? Our hypothesis at this point was that for some reason the classes of workspaces were sometimes reverting to their old counterparts, e.g. what had been migrated to a GenericWorkspaceFolder was turning back into an old Workspace object. Unfortunately, we were clueless about what triggered these reversions, and why it happened only in the separate instances.
For quite a while we stayed clueless. We didn't have a way to trigger the problem, and we couldn't get a debugger near it. We had a workaround for fixing it temporarily by resetting the __class__ attribute of a workspace on every single affected Zope instance, but that was it. The decisive idea finally came from taking a closer look at the workspace grouping feature.
Light On The Horizon
Workspaces can be grouped into a hierarchy by assigning them to a "superspace".
These assignments are implemented as relations between child workspace and parent superspace (using z3c.relationfield). There were two or three superspaces with a large number of child workspaces where the type change was reported very often. At first I dismissed this - a large number of workspaces of course meant a higher probability of the issue manifesting itself, and as the workspaces inside the superspaces all had the same type, a type change and therefore colour change of one of them was very noticeable. So the fact that problems were reported here more often does not necessarily imply that they were happening here more often. But then I found that when I started an instance with a fresh client cache and as the first request looked at a listing of all sibling workspaces inside a superspace, then I could reliably trigger the type change issue!
Ghosts In The Machine
Finally we had something to set our debuggers on! We traced how this listing accessed the superspace and the relations to its children. z3c.relationfield uses zope.intid to identify the objects involved in a relation. zope.intid assigns integer IDs to all known objects. To keep track of what object has what ID, a mapping of ID to object is stored in an IOBTree. Now, this IOBTree does not contain the object itself. It is already stored elsewhere in the ZODB, and it's not stored twice. What the tree contains is a persistent reference to the object.
What's a persistent reference? Think of it as a kind of pointer to an object; a data structure that holds enough information to load the actual object from the database. As you may or may not know, objects that are loaded from the ZODB are initially in the "Ghost" state, which is one of the life cycle states of persistent objects. A ghost is mostly empty. No attributes are available - they are only loaded on demand, when some code accesses them. For example, when one object's data is loaded, say a Plone portal, and another object like a folder is stored inside it, then this subobject is loaded as a ghost first. We may not even be interested in it at this time, but manipulate some other part of the portal, and we can save some time and memory by not fully fetching the folder.
A persistent reference allows the system to load the referenced object as a ghost without accessing its proper database record. For this purpose it doesn't need much - a ghost being mostly empty - but it does need its class, to know what kind of object it's dealing with. And there's the rub. To know the object's class without fully loading it, the class is stored inside the persistent reference. That means some data duplication, but the performance improvement is definitely worth it, and it's not like the object's class is going to change, right? Right?!
Wrong. We had messed with our classes and now our workspaces were known as different types of objects in different parts of the database. The main records had been updated to the new classes (GenericWorkspaceFolder etc.) but the persistent references all over the ZODB had been left untouched and still contained the old classes (Workspace). So, depending on how a ZEO client first encountered an object, either via its proper record or via a persistent reference, it would remember it like that. When the ghost would later be filled with data, the class would not be looked up again, because it isn't supposed to be different. Consequently one client could have a different class for an object in memory than another client - for as long as the instance cache lasted.
There At Last
What saved us in the end was the brilliant zodbupdate script, which, as it turns out, was created for cases just like this one. You can provide it with a mapping from old to new classes, and it will crawl the database updating any objects and persistent references using the old classes. In our case we had to introduce a little tweak because we didn't have a one-to-one mapping of classes. As mentioned before the old system had only one workspace class, but the new system had many. We solved this by simply passing in a list of objects with known new classes. We had already migrated the workspaces after all, so this list was reasonably easy to generate. A small modification of the zodbupdate script made it skip the regular process for these known objects.
What we're taking away from this is a deeper understanding of the ZODB. Usually we don't need to know much about it - it's just there, doing its job perfectly. But when descending to the lower levels, like for a custom migration, it's good to know your way around, as we've learned the hard way. Our ignorance about persistent references caused us to do an incomplete migration, and the behaviour of the instance cache made it even harder to figure out what had gone wrong. Persistence (no pun intended) in debugging and talking to people with more knowledge in the relevant area led us to the solution.
My advice, however, is not to leave the lower levels alone (which would of course have saved us a lot of trouble) but rather to go there more often, experiment more (in safe sandbox environments), learn more about the things you're using every day without even noticing, ideally way before you need that knowledge. Even the ZODB is not magic, just some pretty smart code. For my own part, I'm certainly motivated to look at this code more often than I have before.