In a previous post I presented some figures that showed that in many cases the time to replicate a dataset from one docbase to another is governed by the size of the source dataset and not the amount of changes to be replicated. I want to provide a deeper understand of what is happening to cause this. As this involves a fair amount of preliminary discussion of the workings of object replication I’ll make this a 2-part post. This post will discusses what Documentum will dump when you set up a replication job. The 2nd part will discuss how Oracle deals with the database queries and how this affects the throughput of the replication job. I would urge you to read these posts even if you don’t use Object Replication as the issues I discuss are applicable to wider design situations than just object replication.
For those that have not looked in detail or even used the Documentum distributed architectures, object replication is one of a number of different options that can be used to improve access times when users are accessing objects in a docbase. Object replication is a multi-docbase architecture where objects in one docbase (the source) are ‘copied’ to a 2nd docbase (the target).
The motivation for this is:
- Users remote to the source, possibly with bandwidth, latency or connectivity problems, can have access to objects created in the source repository by accessing the copies (‘replicas’) in the target docbase
- Scarce peak-time bandwidth can be minimised by replicating objects from a remote repository to a local repository outside of peak hours
Object replication is a rich and complex package with a number of different options to choose from. However what I want to concentrate on here is the underlying implementation and how that determines the performance characteristics of the object replication setup.
Object replication consists of 3 processes:
The dump process is responsible for identifying objects to be replicated. The process is actually a specialised form of the ‘dump’ api, a low-level facility provided by the Content Server to enable an administrator to create a (full or partial) extract of the docbase. In essence the dump replication script constructs a dump object containing the ‘parameters’ for the operation and the Content Server initiates the dump when the object is saved to the repository. The parameters consist of database queries identifying the objects that need to be replicated. The output of the dump process is a dump file containing details of the objects to be replicated.
The dump file is then moved from the source repository to the target repository by the transport process and then the load process is invoked. The load process is a specialised form of the ‘load’ api, a facility to load a dump file into a repository. Again the external interface is very simple. A load object is created containing details of the dump file to be loaded into the repository and then saved to start the load process.
I am going to concentrate on the performance aspects of the dump process in this 2-part post.
INTERNAL OPERATION OF THE DUMP PROCESS
Internally the dump process receives the database queries specified in the construction of the dump object. The Content Server issues the queries to the Oracle database in turn. Each row returned represents an object that must be written to the dump file. This collection of objects identified by the queries specified in the dump object are the root dataset.
Each object from the root dataset that is written to the dump file will also have related objects that need to be dumped as well. The first time the dump process encounters the object it will dump the document itself together with all of it’s directly related objects. Directly related objects are:
1. objects identified by ID attributes of the dumped object
2. content (dmr_content) objects where parent_id = dumped object
3. containment (dmr_containment) objects where parent_id = dumped object (virtual docs)
4. acl (dm_acl) object for dumped object
5. relation (dm_relation) objects where parent_id = dumped object
A couple of points to note here, first the dump process is recursive. When a directly related object is dumped it will also have its own directly related objects dumped. The directly related objects will also have their own directly related objects that must also be dumped and so on. In some cases the dump of a single object can result in a huge graph of related objects being dumped as well.
Second, ID attributes include the i_chronicle_id and i_antecedent_id attributes. These attributes define the version tree for the object. With a recursive dump of the related objects this ensures that the full version tree of the root object is dumped.
Third, when a relation (dm_relation) object is dumped the object referenced by the child_id is recursively dumped since child_id is an ID attribute. If there are a number of objects linked by relation objects these will all be recursively dumped.
To identify all these Related objects requires a number of extra queries that must be executed for each dumped object. Where the source dataset is large and each object has a large number of related objects the resulting number of queries required can grow dramatically.
The replication dump process has an optimisation whereby objects dumped in a previous run are recorded in a database table (dm_replica_catalog) and are not subsequently dumped to the dump file. This ‘incremental’ option operates as long as the ‘Full Refresh’ flag is not set on the replication job object.
The incremental option reduces the amount of information that must be stored in the dump file but it should be appreciated that each query must still be run to check if the document or the relationship has changed. Where the source dataset is large and relatively static, the time spent checking the documents and relations of the source dataset far outweigh the time spent dumping new or changed documents.
In the next post I will describe the throughput limitations the database imposes on the replication design.
Update (11 July 2007): Don’t despair, it’s coming soon!