Installing D6 on Oracle RAC

June 12, 2008 at 5:13 pm | Posted in D6, Documentum and Oracle, Performance | 1 Comment
Tags: , ,

It’s a tricky business keeping Release Notes up-to-date. I’ve just been browsing through the latest D6 SP1 Release Notes and my attention was caught by a small section about installing on Oracle RAC:

Installing with Oracle Real Application Clusters (121442)
If you are installing Content Server with Oracle Real Application Clusters (RAC), set
the value of the Oracle parameter MAX_COMMIT_PROPAGATION_DELAY to 0 (zero).
This value is required to ensure that the data that Content Server uses is consistent across
all Oracle nodes. Values other than zero are not supported.

I presume this has been in the various Content Server release notes for a while and it would have been important as using the default (or any other value here) uses a commit scheme that can delay other Oracle nodes from seeing changes. Since a single Content Server session often uses more than one database session (if you use DF_EXEC_QUERY in a DFC query call you are asking Content Server to start a new Oracle session) and those sessions could be attached to 2 different Oracle RAC nodes the delay in seeing recently changed values could cause havoc.

Now I know what your thinking, since we would obviously prefer to have data available immediately why would Oracle not use 0 as the default and why wouldn’t everyone just set it to 0 anyway? The answer of course is that there is a cost to be paid in performance; having to issue network calls to propagate information about a commit could be very costly (certain platforms seemed to have real problems with it), so in many cases the the default setting was fine provided your Oracle application didn’t have functional problems.

However since Oracle 10g Release 2 this parameter is deprecated – Oracle now uses the ‘broadcast on commit’ behaviour implied by MAX_COMMIT_PROPAGATION_DELAY=0 automatically. The best overview of this change is given in this paper by Dan Norris. Since the Content Server in D6 is only certified for Oracle on 10g Release 2 the entry shown above in the Release Notes is no longer needed. In fact it you could argue it is positively harmful. As they say, forewarned is forearmed.

By the way I stumbled on this bit of information whilst perusing the ‘Support by Product’ section on Powerlink. It is currently under beta and contains amongst other things a section for Documentum Content Server. It’s basically a portal type view of Content Server support information, bringing together support notes, whitepapers, documentation (there’s a very nice Top manuals section) and so on. I think it’s a brilliant idea and I urge everyone to have a look and provide feedback to the support team.

DQL Hints – Optimize Top

August 14, 2007 at 11:30 am | Posted in Documentum and Oracle, Performance | 5 Comments

DQL Hints are a very useful feature brought in by Documentum 5 that has been incrementally improved from 5.1 through 5.3. No doubt we will have some more in Documentum 6. DQL hints usually translate into a SQL level hint when Content Server transforms DQL into a SQL statement. For example when the database is Oracle the SQL level hint will look some thing like this:

select /*+ hint */ object_name from dm_sysobject_s

The hint is added after the ‘select’ keyword between the /*+ and */ tokens.

There is a DQL Hint called optimize top which might produce a performance boost when returning the first few records from a potentially large result set. According to the DQL documentation this translates to an ‘optimize top’ hint in Oracle. This got me scratching my head as I didn’t recall such a hint in Oracle. A quick google confirmed that this is not a recognised Oracle hint so the quickest way of finding out is to check the actual SQL produced by documentum. As I’ve pointed out before there are 2 ways to do this, either call ‘execute get_last_sql’ after you have run a DQL or use the Documentum Administrator ‘show SQL’ option (which does the same thing on your behalf). What I found was that ‘optimize top’ actually translates to a ‘first_rows’ hint in Oracle.

The idea of first_rows is that the Cost-based Optimizer (CBO) should choose a query execution plan that retrieves the first rows quickly even if the overall cost to retrieve all rows is greater than other query plans. This is potentially a good idea for many queries that are executed in ‘paged’ query scenarios. A paged query occurs for example when viewing a large folder in Webtop. The query to return all the documents is run but only the first 50 or 100 rows are initially retrieved to build the first few pages. If the user then selects the ‘next’ page the next 50 or 100 rows are retrieved.

first_rows is actually a deprecated hint in Oracle. Since Oracle 9 the preferred hint is first_rows_n where n=the number of rows to optimise for. So for example first_rows_10 will optimise for returning the first 10 rows quickly. If you want to use this hint from DQL then you would have to use ‘pass-through’ hints:

select * from dm_document enable(oracle('first_rows_10'))

Performance Tip: Avoid ORed Folder predicates (2)

July 26, 2007 at 12:03 pm | Posted in Documentum and Oracle, Performance | Leave a comment

I little while ago I posted some advice on avoiding performance problems with queries that look like this:

select ... from ... where folder('/ABC') or folder('/XYZ')

In addition to the advice I gave in that original post you can also write the above as:

select ... from ... where folder('/ABC','/XYZ')

This construct appears to perform well and should definitely be preferred to the version with ORed Folder predicates.

The general syntax for the folder predicate is:

folder(path1[, path2,...][,descend])

so you can use the descend clause. This means that the following:

select ... from ... where folder('/ABC',descend) or folder('/XYZ',descend)

could also be better written as:

select ... from ... where folder('/ABC','/XYZ',descend)

Of course if you need to mix your folder descend predicates then your back to the original solution.

Object Replication Performance

June 4, 2007 at 9:36 pm | Posted in Documentum and Oracle, Object Replication, Performance | 4 Comments

In a previous post I presented some figures that showed that in many cases the time to replicate a dataset from one docbase to another is governed by the size of the source dataset and not the amount of changes to be replicated. I want to provide a deeper understand of what is happening to cause this. As this involves a fair amount of preliminary discussion of the workings of object replication I’ll make this a 2-part post. This post will discusses what Documentum will dump when you set up a replication job. The 2nd part will discuss how Oracle deals with the database queries and how this affects the throughput of the replication job. I would urge you to read these posts even if you don’t use Object Replication as the issues I discuss are applicable to wider design situations than just object replication.

For those that have not looked in detail or even used the Documentum distributed architectures, object replication is one of a number of different options that can be used to improve access times when users are accessing objects in a docbase. Object replication is a multi-docbase architecture where objects in one docbase (the source) are ‘copied’ to a 2nd docbase (the target).

The motivation for this is:

  1. Users remote to the source, possibly with bandwidth, latency or connectivity problems, can have access to objects created in the source repository by accessing the copies (‘replicas’) in the target docbase
  2. Scarce peak-time bandwidth can be minimised by replicating objects from a remote repository to a local repository outside of peak hours

Object replication is a rich and complex package with a number of different options to choose from. However what I want to concentrate on here is the underlying implementation and how that determines the performance characteristics of the object replication setup.

Object replication consists of 3 processes:

  • dump
  • transport
  • load

The dump process is responsible for identifying objects to be replicated. The process is actually a specialised form of the ‘dump’ api, a low-level facility provided by the Content Server to enable an administrator to create a (full or partial) extract of the docbase. In essence the dump replication script constructs a dump object containing the ‘parameters’ for the operation and the Content Server initiates the dump when the object is saved to the repository. The parameters consist of database queries identifying the objects that need to be replicated. The output of the dump process is a dump file containing details of the objects to be replicated.

The dump file is then moved from the source repository to the target repository by the transport process and then the load process is invoked. The load process is a specialised form of the ‘load’ api, a facility to load a dump file into a repository. Again the external interface is very simple. A load object is created containing details of the dump file to be loaded into the repository and then saved to start the load process.

I am going to concentrate on the performance aspects of the dump process in this 2-part post.

INTERNAL OPERATION OF THE DUMP PROCESS

Internally the dump process receives the database queries specified in the construction of the dump object. The Content Server issues the queries to the Oracle database in turn. Each row returned represents an object that must be written to the dump file. This collection of objects identified by the queries specified in the dump object are the root dataset.

Each object from the root dataset that is written to the dump file will also have related objects that need to be dumped as well. The first time the dump process encounters the object it will dump the document itself together with all of it’s directly related objects. Directly related objects are:

1. objects identified by ID attributes of the dumped object
2. content (dmr_content) objects where parent_id = dumped object
3. containment (dmr_containment) objects where parent_id = dumped object (virtual docs)
4. acl (dm_acl) object for dumped object
5. relation (dm_relation) objects where parent_id = dumped object

A couple of points to note here, first the dump process is recursive. When a directly related object is dumped it will also have its own directly related objects dumped. The directly related objects will also have their own directly related objects that must also be dumped and so on. In some cases the dump of a single object can result in a huge graph of related objects being dumped as well.

Second, ID attributes include the i_chronicle_id and i_antecedent_id attributes. These attributes define the version tree for the object. With a recursive dump of the related objects this ensures that the full version tree of the root object is dumped.

Third, when a relation (dm_relation) object is dumped the object referenced by the child_id is recursively dumped since child_id is an ID attribute. If there are a number of objects linked by relation objects these will all be recursively dumped.

To identify all these Related objects requires a number of extra queries that must be executed for each dumped object. Where the source dataset is large and each object has a large number of related objects the resulting number of queries required can grow dramatically.

The replication dump process has an optimisation whereby objects dumped in a previous run are recorded in a database table (dm_replica_catalog) and are not subsequently dumped to the dump file. This ‘incremental’ option operates as long as the ‘Full Refresh’ flag is not set on the replication job object.

The incremental option reduces the amount of information that must be stored in the dump file but it should be appreciated that each query must still be run to check if the document or the relationship has changed. Where the source dataset is large and relatively static, the time spent checking the documents and relations of the source dataset far outweigh the time spent dumping new or changed documents.

In the next post I will describe the throughput limitations the database imposes on the replication design.

Update (11 July 2007): Don’t despair, it’s coming soon!

Proving database connectivity

April 25, 2007 at 3:23 pm | Posted in Architecture, Documentum and Oracle, Troubleshooting | 2 Comments

I was faced with the problem of a Content Server not starting following a server reboot. I brought up Documentum Server Manager which had both the Start and Stop buttons greyed out. This usually means the service has been asked to start up but is still in the startup process. After 10 minutes with no CPU or IO activity I concluded that the connection to the database had hung. Killing the documentum.exe process and starting the content server service again had no effect so I started a command prompt and typed:

c:\>sqlplus /nolog
SQL>connect system/******

Connected

No problem with connecting to the database.

After scratching my head for a few minutes I thought I’d check the Oracle TNS Listener:

c:\>lsrnctl
LSNRCTL>status
Connecting to (ADDRESS=(PROTOCOL=tcp)(PORT=1521))

The command hung and usually there’s a load of status information after the connecting line so the problem seems to lie with the listener. After restarting the Oracle TNS Listener service the docbase started up sucessfully.

But hold on! How come SQL Plus could connect? Doesn’t that require the listener to reach the database? In some cases it doesn’t. Using the simple connect system/****** connects you to the default database on the local machine and doesn’t require the listener. However if I had used the connect user/password@connect_string command format:

SQL>connect system/******@dctm3

I would have got the same hanging behaviour I got from the content server. In this case dctm3 is a net service name defined in the tnsnames.ora file.

It’s always useful to have simple troubleshooting tools to test out if all components in the system are up and running. Using SQLPlus to test connectivity to the database is a good idea but the test needs to be as close as possible to the calls the Content Server would have been making. In future my SQLPlus troubleshooting test will be:

  1. Open server.ini for the content server and retrieve the database connection string contained in the database_conn key
  2. Test the connection from SQL Plus using the connection string connect user/password@connect_string

Assessing Query Performance

April 18, 2007 at 8:18 pm | Posted in Documentum and Oracle, Performance, Troubleshooting, Xense Profiler | Leave a comment

In yesterday’s post I talked about the potential performance problems with a particular query. I noticed this problem whilst reviewing a trace file for a batch process I was testing. Since this was a log4j trace the trace lines were output with detailed timings on each line. The trace file consistently had 2 lines that were separated by 8 or more seconds. I wondered what was happening in this period since if this was all time running database queries that’s an awful lot of processing going on.

I captured a dmcl trace whilst running the process and put the output through the Xense Profiler (a DMCL trace profiling tool). The top 10 queries report showed a single query consuming most of the time (our friend the ORed Folder query from yesterday), so I derived the SQL and set to work in SQL Plus to look at the performance of the query.

SQL Plus has a very useful feature called ‘autotrace’. When you execute a query in SQL Plus with autotrace on SQL Plus will output a query execution plan and some very useful execution statistics. The execution statistics from running the problem query were:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     655992  consistent gets
          0  physical reads
          0  redo size
       3528  bytes sent via SQL*Net to client
        558  bytes received via SQL*Net from client
          7  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        132  rows processed

The key statistic here is the 655,992 consistent gets. Each consistent get is a fetch of a database buffer from the database buffer cache (AKA logical IO). This is a large amount of work to retrieve just 132 rows. This explained why the query was taking so long (as a very rough rule of thumb Oracle running on a reasonably fast processor might be able to perform between 50,000 and 100,000 logical IOs). Notice that there were no ‘costly’ physical reads, the time was all spent thrashing through cached data buffers.

Still this is only 8 seconds and this is only a batch process, why worry?

Three reasons:

  • There were no physical reads in the execution statistics because the only query I had been running on the test system was this one. In a real life system there are likely to be many other queries running all utilising the buffer cache. There is a good chance that running this query in a real life production system will mean many of the logical IOs will turn into physical IOs. Even if only 5% become physical reads (i.e. we have a ‘good’ buffer cache hit ratio for this query) that’s 30,000 physical IOs – our query suddenly starts taking a lot longer to execute.
  • The system I was testing on had a relatively small number of objects (about 150,000). Since a production system is likely to be much bigger and probably growing, the number of logical IOs required to complete the query will grow in proportion. This is a consequence of the execution plan for the query; with dm_sysobject_r or dm_sysobject_s as the first table in the join we are effectively scanning through a table that grows in proportion to the size of the docbase.
  • It is always a good idea to seek to reduce the workload placed on the database. This improves scalability and general responsiveness of the system

So what happened after restructuring the query? SQL Plus with autotrace gave:

Statistics
----------------------------------------------------------
         14  recursive calls
          0  db block gets
       1032  consistent gets
          0  physical reads
          0  redo size
       3528  bytes sent via SQL*Net to client
        558  bytes received via SQL*Net from client
          7  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        132  rows processed

Only 1032 logical IOs and since the query can drive through an index on dm_folder_r the performance is largely unaffected by increases in the size of the docbase.

Performance Tip: Avoid ORed Folder predicates

April 17, 2007 at 8:48 pm | Posted in Documentum and Oracle, Performance | 1 Comment

Consider the following query that contains a ‘Folder’ predicate:

select 	r_object_id 
from 	dm_document 
where 	folder('/Temp/Folder1',descend) 

The query selects all documents that are in /Temp/Folder1 or any sub-folders underneath. This sort of query generally performs quite well. However if you want to select documents from 2 or more sets of folder it is best to avoid ORing the folder predicates. E.g. this query may scale poorly as the number of dm_document objects in the system increases:

select 	r_object_id 
from 	dm_document 
where 	folder('/Temp/Folder1',descend) 
OR 	folder('/Temp/Folder2',descend)

Why is this? The SQL generated by the first query looks like this:

select 	all 
	dm_document.r_object_id 
from 	dm_document_sp dm_document, 
	dm_folder_r dm_folder_r1, 
	dm_sysobject_r dm_sysobject_r2 
where 	(    dm_document.r_object_id = dm_sysobject_r2.r_object_id 
	 and dm_sysobject_r2.i_folder_id = dm_folder_r1.r_object_id 
	 and dm_folder_r1.i_ancestor_id = '0b0000038000e990'
	) 
and 	(    dm_document.i_has_folder = 1 
	 and dm_document.i_is_deleted = 0
	)

The folder predicate is implemented by adding the dm_folder_r table into the join and filtering on the i_ancestor_id field. The database optimiser can evaluate different join orders and work out the best table join order. If there are a relatively small number of objects under the folder /Temp/Folder1 then probably the best table to join first is dm_folder_r; subsequent joins to the dm_sysobject_r and dm_sysobject_r table will involve only a small number of rows.

On the other hand the SQL for the ORed predicate DQL looks like this:

select 	all 	
	dm_document.r_object_id 
from 	dm_document_sp dm_document 
where 	(    dm_document.r_object_id in (	
		select 	dm_sysobject_r2.r_object_id 
		from 	dm_sysobject_r dm_sysobject_r2, 
			dm_folder_r dm_folder_r1 
		where 	dm_sysobject_r2.i_folder_id 
				= dm_folder_r1.r_object_id 
		and 	dm_folder_r1.i_ancestor_id 
				= '0b0000038000e990')  
	 or  dm_document.r_object_id in (	
		select 	dm_sysobject_r2.r_object_id 
		from 	dm_sysobject_r dm_sysobject_r2, 
			dm_folder_r dm_folder_r1 
			where 	dm_sysobject_r2.i_folder_id 
					= dm_folder_r1.r_object_id 
			and 	dm_folder_r1.i_ancestor_id 
					= '0b0000038000e995') 
	) 
and 	(    dm_document.i_has_folder = 1 
	 and dm_document.i_is_deleted = 0
	)

The SQL is completely different. In this case the folder predicates are implemented using a correlated sub-query for each folder predicate. The database is unable to find a query plan that allows the optimal join order – the one where dm_folder_r is joined first. The only way the database can execute the query is to select every current dm_document object and attempt to join it to the dm_folder_r. In a large database the work to select every dm_document object could be huge.

So what are the alternatives? I can think of a few but a UNION ALL would be top of my list in this particular case:

select 	r_object_id 
from 	dm_document 
where 	folder('/Temp/Folder1',descend) 
UNION ALL
select 	r_object_id 
from 	dm_document 
where 	folder('/Temp/Folder2',descend)

By the way this performance tip applies whether the folder contains the ‘descend’ clause or not.

RPC 116 and ORA-03113

March 22, 2007 at 4:33 pm | Posted in Documentum and Oracle, Troubleshooting | Leave a comment

You may have encountered this error
[DM_SESSION_E_RPC_ERROR] RPC error 116.
It is returned by the DMCL to a Documentum client (like DFC, WDK, Webtop and so on) when the DMCL RPC client stub has encountered an error whilst waiting for a response from the server. Now this could be due to all sorts of things such as:

  • Network problems
  • Content Server session server crash
  • resource problem on the content server

Despite the text that usually returns with the error, network problems are the least likely source of this problems for most environments. More often than not the Content Server process/thread handling the RPC has hit some sort of problem (possibly a bug) and had to ‘bail out’ with an RPC error message ending up with the client.

The first place to look for information is the docbase log file ($DOCUMENTUM/dba/log) and the session logs ($DOCUMENTUM/dba/log/[docbaseid]/[username]). Often these will contain further error messages that reveal the real source of the problem. A support call to Documentum is often in order at this point particularly if it is a recurring issue.

ORA-03113 is really the Oracle counterpart of the Documentum RPC 116 error which is why I have put them together in this post. When an Oracle client (which for Documentum users means the Content Server) creates a session with Oracle it is connected to a process/thread in the Oracle server. Should this process encounter a problems that causes it to abort the client process receives a ORA-03113 error. Again this could be network problems but, particularly for Documentum Content Server setups where the Content Server is on the same machine as Oracle, this is less likely than other explanations.

I have encountered these problems in WebPublisher when running an ‘apply presentation’ command on a large number of files. In this particular case this turned out to be an Oracle bug which was resolved by upgrading from 9.2.0.1.0 to 9.2.0.6.0. The general approach is to look for process errors in the Oracle cdump log directory and if necessary raise a case with Oracle via metalink.

a last point: even though I have said network problems are the least likely source of problems, keep an open mind!

Magic SANs

February 7, 2007 at 10:31 am | Posted in Architecture, Documentum and Oracle, Performance | Leave a comment

There is an excellent piece here about some of the choices to be made in configuring SANs for your application. The key point is that if your application, or part of it, has performance that is constrained by disk I/O then you should care about how that magic box of tricks is configured. This consideration implies that you should know something about the level of disk I/O your application requires!

In a similar vein I recall reading a presentation (possibly by Anjo Kolk) about the typical Docbase Admin/DBA conversation with the SAN administrator, something like:

DBAdmin: We need to setup a new docbase and database
SAN admin: Ok how much space do you need?
DBAdmin: 50Gb for the database, 1TB for the docbase
SAN admin: OK you can access the space here …

Not a mention of how many I/Os per second the application will require. Don’t be afraid to ask these questions.

By the way SAME stands for Stripe and Mirror Everywhere, an Oracle approach to fully utilising disk resources. See this document, many of the disk performance details are not specific to Oracle.

What SQL is being used?

February 2, 2007 at 10:04 am | Posted in Documentum and Oracle, Performance, Troubleshooting | Leave a comment

What SQL is generated by the following DQL query:

select object_name from dm_document

Well the answer depends on a number of things:

  • What user I am
  • What version of Content Server I’m using
  • Certain server.ini parameters
  • etc

The most important one is the first one, and particularly whether the user is a superuser or not. If I am a superuser then the query will look something like this:

select     all dm_document.object_name 
from       dm_document_sp dm_document 
where    (    dm_document.i_has_folder = 1 
             and dm_document.i_is_deleted = 0)

But if I don’t have superuser privileges it will look something like this:

select     all dm_document.object_name 
from       dm_document_sp dm_document 
where      (    dm_document.i_has_folder = 1 
            and dm_document.i_is_deleted = 0) 
and        (  (   dm_document.owner_name = 'Robin East' 
               or dm_document.owner_name in ('administrator')
              ) 
            or exists (select    * 
                       from      dm_acl_sp, 
                                 dm_acl_rp 
                       where     dm_acl_sp.r_object_id = dm_acl_rp.r_object_id 
                       and       dm_document.acl_domain = dm_acl_sp.owner_name 
                       and       dm_document.acl_name = dm_acl_sp.object_name 
                       and       dm_acl_rp.r_accessor_permit >= 2 
                       and       (    dm_acl_rp.r_accessor_name = 'dm_world' 
                                   or (    dm_document.owner_name = 'Robin East' 
                                       and dm_acl_rp.r_accessor_name = 'dm_owner'
                                      ) 
                                   or dm_acl_rp.r_accessor_name = 'Robin East' 
                                   or (    dm_acl_rp.r_is_group = 1 
                                       and dm_acl_rp.r_accessor_name in ('administrator')
                                      )
                                 )
                       )
             )

All that extra stuff results in joins or subqueries on the dm_acl tables to enforce permissions; superusers always have read access to everything so there is no need for the additional predicates.

The upshot is: be very careful when tuning DQL queries by turning them into SQL. If you have a DQL that is slow for a non-superuser and you log in to Administrator as dmadmin to identify the SQL being generated you will not be using the same SQL!!!

Here are some ways to make sure you are getting the right SQL:

  1. login to DA as the user – you’ll need her password! Now you can run the query and use ‘Show SQL’ to get the right SQL
  2. If you know how, run the API trace,c,10,,SQL_TRACE from her session. In WDK apps you can do this by accessing the API tester component. All SQL statements generated for her session will appear in the user’s session log on the Content Server
  3. The following API commands can be run by a superuser in IAPI:

    getlogin,c,user (this returns a ticket for the user)
    connect,docbase,user,ticket
    run query
    ?,c,execute get_last_sql

  4. execute get_last_sql will give you the SQL for the DQL command just run, this is the command that is run when you click ‘Show SQL’ in the DA query interface.

Next Page »

Blog at WordPress.com.
Entries and comments feeds.