50,000 and counting

May 14, 2008 at 8:51 pm | Posted in Object Replication | 5 Comments

16 months on, the blog has finally clocked up 50,000 hits! My original motiviation for blogging was to be able to write something less formal and that took less time to write and format than the articles I was putting out on the Xense website. At the time there were very few Documentum blogs (Johnny Gee’s was probably the only one I noticed regularly) and I was keen to do something deeply technical in a similar vein to Jonathan Lewis (Oracle) or Mark Russinovitch (Windows).

One of the things that has surprised is just how much effort it is to keep finding the time and inspiration to write. Back in January 2007 I managed more than one article a week, these days I think I’m doing well if I can do a couple a month. Partly this is because I had a number of small pieces of research that were already done and simply needed to be turned into words. These days I still have loads of ideas but so little time to follow up and do the research.

The focus has changed a little bit too. When I started I had been spending a lot of time knee-deep in object replication. Frankly I don’t like the technology and would be very hesitant to recommend it on a project. Part of the problem is the obscurity of the implementation. The serious guts of the workings are embedded deep into the Content Server C/C++ code so it’s not easy to work out what is going on when it fails unless you want to dive into the assembly-level debugger (which I have resorted to on occasion).

The other problem is the dump and load process that underlies it. Dump and load is simply too flaky for a reliable replication solution. I managed to find various ways of crashing the content server (which is catastrophic on the thread-based windows implementation) which I wrote up for a client in a document called ‘Killing the Content Server’. I sent it to Documentum support too.

Here’s to the next 50,000!

BTW I’ve added in a blogroll entry for Andrew Binstock – and excellent blog covering all sorts of things Java Development related. Very honest, very open.

Object Replication Performance

June 4, 2007 at 9:36 pm | Posted in Documentum and Oracle, Object Replication, Performance | 4 Comments

In a previous post I presented some figures that showed that in many cases the time to replicate a dataset from one docbase to another is governed by the size of the source dataset and not the amount of changes to be replicated. I want to provide a deeper understand of what is happening to cause this. As this involves a fair amount of preliminary discussion of the workings of object replication I’ll make this a 2-part post. This post will discusses what Documentum will dump when you set up a replication job. The 2nd part will discuss how Oracle deals with the database queries and how this affects the throughput of the replication job. I would urge you to read these posts even if you don’t use Object Replication as the issues I discuss are applicable to wider design situations than just object replication.

For those that have not looked in detail or even used the Documentum distributed architectures, object replication is one of a number of different options that can be used to improve access times when users are accessing objects in a docbase. Object replication is a multi-docbase architecture where objects in one docbase (the source) are ‘copied’ to a 2nd docbase (the target).

The motivation for this is:

  1. Users remote to the source, possibly with bandwidth, latency or connectivity problems, can have access to objects created in the source repository by accessing the copies (‘replicas’) in the target docbase
  2. Scarce peak-time bandwidth can be minimised by replicating objects from a remote repository to a local repository outside of peak hours

Object replication is a rich and complex package with a number of different options to choose from. However what I want to concentrate on here is the underlying implementation and how that determines the performance characteristics of the object replication setup.

Object replication consists of 3 processes:

  • dump
  • transport
  • load

The dump process is responsible for identifying objects to be replicated. The process is actually a specialised form of the ‘dump’ api, a low-level facility provided by the Content Server to enable an administrator to create a (full or partial) extract of the docbase. In essence the dump replication script constructs a dump object containing the ‘parameters’ for the operation and the Content Server initiates the dump when the object is saved to the repository. The parameters consist of database queries identifying the objects that need to be replicated. The output of the dump process is a dump file containing details of the objects to be replicated.

The dump file is then moved from the source repository to the target repository by the transport process and then the load process is invoked. The load process is a specialised form of the ‘load’ api, a facility to load a dump file into a repository. Again the external interface is very simple. A load object is created containing details of the dump file to be loaded into the repository and then saved to start the load process.

I am going to concentrate on the performance aspects of the dump process in this 2-part post.

INTERNAL OPERATION OF THE DUMP PROCESS

Internally the dump process receives the database queries specified in the construction of the dump object. The Content Server issues the queries to the Oracle database in turn. Each row returned represents an object that must be written to the dump file. This collection of objects identified by the queries specified in the dump object are the root dataset.

Each object from the root dataset that is written to the dump file will also have related objects that need to be dumped as well. The first time the dump process encounters the object it will dump the document itself together with all of it’s directly related objects. Directly related objects are:

1. objects identified by ID attributes of the dumped object
2. content (dmr_content) objects where parent_id = dumped object
3. containment (dmr_containment) objects where parent_id = dumped object (virtual docs)
4. acl (dm_acl) object for dumped object
5. relation (dm_relation) objects where parent_id = dumped object

A couple of points to note here, first the dump process is recursive. When a directly related object is dumped it will also have its own directly related objects dumped. The directly related objects will also have their own directly related objects that must also be dumped and so on. In some cases the dump of a single object can result in a huge graph of related objects being dumped as well.

Second, ID attributes include the i_chronicle_id and i_antecedent_id attributes. These attributes define the version tree for the object. With a recursive dump of the related objects this ensures that the full version tree of the root object is dumped.

Third, when a relation (dm_relation) object is dumped the object referenced by the child_id is recursively dumped since child_id is an ID attribute. If there are a number of objects linked by relation objects these will all be recursively dumped.

To identify all these Related objects requires a number of extra queries that must be executed for each dumped object. Where the source dataset is large and each object has a large number of related objects the resulting number of queries required can grow dramatically.

The replication dump process has an optimisation whereby objects dumped in a previous run are recorded in a database table (dm_replica_catalog) and are not subsequently dumped to the dump file. This ‘incremental’ option operates as long as the ‘Full Refresh’ flag is not set on the replication job object.

The incremental option reduces the amount of information that must be stored in the dump file but it should be appreciated that each query must still be run to check if the document or the relationship has changed. Where the source dataset is large and relatively static, the time spent checking the documents and relations of the source dataset far outweigh the time spent dumping new or changed documents.

In the next post I will describe the throughput limitations the database imposes on the replication design.

Update (11 July 2007): Don’t despair, it’s coming soon!

Object Replication Performance

January 8, 2007 at 5:48 pm | Posted in Object Replication, Performance | 3 Comments

In a recent post on the Documentum Support Forums I alluded to some testing I had done on Documentum Object Replication. I give the test results here to demonstrate some important design points when deciding how (or if) to deploy object replication.

The test involved timing a replication run from start to finish (source and target) using different sizes of data set. In each run no changes had been made to the replicated data so you might expect that:

  1. Each replication run is quite quick
  2. The time to replicate is related to the number of changes made to objects under the replication folder

The replication used the following options:

  • Full Refresh – false
  • Fast Replication – false

Here are the results:

Objects       Duration
-------       --------
  1,000           64s
 10,000          344s
 50,000         2032s
100,000         4213s

As you can see it takes a substantial amount of time to replicate no changes! In fact the time the replication takes is proportional to the number of objects under the replication folder (nb. Don’t take these figures as some sort of guideline for how long a replication will take as this figure will depend crucially on the number of relationships between objects and the speed of the processor).

The problem is related to the way Documentum’s dump API tries to check each object and its related objects to see if they have changed; even with a very simple dataset with no user-created relationships there are usually 6 or 7 queries to run for each object and this is a substantial amout of processing when aggregated over a large number of objects to check. For more complicated datasets with multiple relationships between objects the processing times will increase markedly.

If you think this type of behaviour is unsatisfactory for your particular requirement there are a couple of things you can look at.

  1. Rather than use a single replication folder, try to split up the data across multiple folder and create a replication job for each one. Since each replication job is a single-threaded process multiple CPUs do not benefit a single replication job however multiple Replication Jobs can take advantage of multi-processor machines
  2. Consider using the Fast Replication option. Fast Replication reduces the amount of checking for Related Objects and, particularly for datasets with multiple relationships between objects, can be considerably faster for no-change replication runs. Make sure you test thoroughly though particularly if you are relying on changes to Related Objects being automatically replicated

Blog at WordPress.com.
Entries and comments feeds.