Installing D6 on Oracle RAC

June 12, 2008

It’s a tricky business keeping Release Notes up-to-date. I’ve just been browsing through the latest D6 SP1 Release Notes and my attention was caught by a small section about installing on Oracle RAC:

Installing with Oracle Real Application Clusters (121442)
If you are installing Content Server with Oracle Real Application Clusters (RAC), set
the value of the Oracle parameter MAX_COMMIT_PROPAGATION_DELAY to 0 (zero).
This value is required to ensure that the data that Content Server uses is consistent across
all Oracle nodes. Values other than zero are not supported.

I presume this has been in the various Content Server release notes for a while and it would have been important as using the default (or any other value here) uses a commit scheme that can delay other Oracle nodes from seeing changes. Since a single Content Server session often uses more than one database session (if you use DF_EXEC_QUERY in a DFC query call you are asking Content Server to start a new Oracle session) and those sessions could be attached to 2 different Oracle RAC nodes the delay in seeing recently changed values could cause havoc.

Now I know what your thinking, since we would obviously prefer to have data available immediately why would Oracle not use 0 as the default and why wouldn’t everyone just set it to 0 anyway? The answer of course is that there is a cost to be paid in performance; having to issue network calls to propagate information about a commit could be very costly (certain platforms seemed to have real problems with it), so in many cases the the default setting was fine provided your Oracle application didn’t have functional problems.

However since Oracle 10g Release 2 this parameter is deprecated - Oracle now uses the ‘broadcast on commit’ behaviour implied by MAX_COMMIT_PROPAGATION_DELAY=0 automatically. The best overview of this change is given in this paper by Dan Norris. Since the Content Server in D6 is only certified for Oracle on 10g Release 2 the entry shown above in the Release Notes is no longer needed. In fact it you could argue it is positively harmful. As they say, forewarned is forearmed.

By the way I stumbled on this bit of information whilst perusing the ‘Support by Product’ section on Powerlink. It is currently under beta and contains amongst other things a section for Documentum Content Server. It’s basically a portal type view of Content Server support information, bringing together support notes, whitepapers, documentation (there’s a very nice Top manuals section) and so on. I think it’s a brilliant idea and I urge everyone to have a look and provide feedback to the support team.


Documentum and Jython - part Deux

May 16, 2008

In my first post about the joys of Jython and Documentum I just showed some bare bones code to login and perform an action. One of the things I like is the jython interpreter where you can effectively run DFC line-by-line at the command prompt and see the effect. However all that typing can get a little tedious so it makes sense to wrap up some of the useful stuff into a module that can be imported into Jython.

First lets see how we can wrap the DFC connection code into a function that can be called multiple times. Fire up the jython interpreter and enter the following:

>>> from com.documentum.com import DfClientX
>>> def connect(docbase, username, password):
...  cx=DfClientX()
...  c  =cx.getLocalClient()
...  li  =cx.getLoginInfo()
...  li.setUser(username)
...  li.setPassword(password)
...  s = c.newSession(docbase, li)
...  return s
...
>>>

The first line is the standard import of the DfClientX class from which we can dynamically create (directly or indirectly) most of the other DFC classes.

The second line is the way to create functions in Jython. The def keyword starts a function definition. It is followed by the function name and then the parameters in brackets - Jython is not strictly typed so we don’t need to specify types in the function definition. The line ends with a colon. This is the standard python/jython way of indicating a multi-line statement block; unlike java there are no curly brackets to delimit statement blocks. Following the function definition line each line of the function must be indented. Every indented line up to and including the return statement is part of the function (notice the interpreter changes the >>> to ...). The function definition is ended by a blank line - after this the interpreter returns to the >>>.

Now we can simply call the function like this:

>>> s1 = connect("mydocbase","dmadmin","dmadmin")
>>> s2 = connect("mydocbase","user1","user1pass")

to give us 2 sessions, one for dmadmin and one for user1.

Now we could use the session to create an object:

doc = s1.newObject("dm_document")
doc.setObjectName("my document")
doc.link("/cabinet1/folder2")
doc.save()

I think you get the idea!

But we don’t want to have to create the connect function every time, so copy all the code into a separate file:

from com.documentum.com import DfClientX

def connect(docbase, username, password):
 cx=DfClientX()
 c  =cx.getLocalClient()
 li  =cx.getLoginInfo()
 li.setUser(username)
 li.setPassword(password)
 s = c.newSession(docbase, li)
 return s

Name it dctm.py (py is the standard python/jython script extension) and save it in the main Jython directory (where you installed jython which in my case is c:\jython2.2.1). Now restart the the jython interpreter and enter the following:

>>> import dctm
>>> s = dctm.connect("docbase2","dmadmin","dmadmin")

The import statement pulls in the code from the dctm.py script and makes the function available via the dctm namespace. Once we have the IDfSession object s we can again do some work:

doc2 = s.newObject("dm_document")
doc2.setObjectName("another document")
doc2.save()

Enjoy!


50,000 and counting

May 14, 2008

16 months on, the blog has finally clocked up 50,000 hits! My original motiviation for blogging was to be able to write something less formal and that took less time to write and format than the articles I was putting out on the Xense website. At the time there were very few Documentum blogs (Johnny Gee’s was probably the only one I noticed regularly) and I was keen to do something deeply technical in a similar vein to Jonathan Lewis (Oracle) or Mark Russinovitch (Windows).

One of the things that has surprised is just how much effort it is to keep finding the time and inspiration to write. Back in January 2007 I managed more than one article a week, these days I think I’m doing well if I can do a couple a month. Partly this is because I had a number of small pieces of research that were already done and simply needed to be turned into words. These days I still have loads of ideas but so little time to follow up and do the research.

The focus has changed a little bit too. When I started I had been spending a lot of time knee-deep in object replication. Frankly I don’t like the technology and would be very hesitant to recommend it on a project. Part of the problem is the obscurity of the implementation. The serious guts of the workings are embedded deep into the Content Server C/C++ code so it’s not easy to work out what is going on when it fails unless you want to dive into the assembly-level debugger (which I have resorted to on occasion).

The other problem is the dump and load process that underlies it. Dump and load is simply too flaky for a reliable replication solution. I managed to find various ways of crashing the content server (which is catastrophic on the thread-based windows implementation) which I wrote up for a client in a document called ‘Killing the Content Server’. I sent it to Documentum support too.

Here’s to the next 50,000!

BTW I’ve added in a blogroll entry for Andrew Binstock - and excellent blog covering all sorts of things Java Development related. Very honest, very open.


Documentum 5 Profiler

May 8, 2008

Xense Profiler v1.3, the Documentum 5 performance profiler, has now been officially released after an extended and successful beta period. During that time there have been no bugs or issues reported.

If you need to quickly analyse systems for performance problems Xense Profiler is the fastest and most convenient tool for the job. No need to import files into Excel for analysis; the Xense Profiler, dmclprof, analyses DMCL trace files and creates HTML-based reports that provide the information you need to diagnose system performance problems.

One of the new features of v1.3 is a Top Queries report. In previous versions you had to rely on scanning through the Query Summary report to find long-running queries; for large traces with lots of queries this could be inconvenient.

Top Queries reports the Top 20 longest-running queries. In this case longest-running means the queries taking the most time to complete. Since its inception Xense Profiler has calculated the true cost of a query. Most other approaches to performance analysis simply record the time taken to complete a DfQuery.execute() call, Xense Profiler aggregates the duration of the execute() call and all the corresponding next() calls as well. Only in this way can you be sure that you really have identified the long-running queries in your system. The documentation is now on-line with examples of the reports.

If you are interested in using Xense Profiler for your systems you should check out the information on the Xense website.


Where is dmcl.ini in D6?

May 2, 2008

Always interesting to look at the search terms people are using when they reach my blog. One I noticed this morning is ‘documentum dfc 6.0 install dmcl.ini’. Looks like someone is installing D6 and wants to know where the dmcl.ini is.

The DMCL and the dmcl.ini has been part of Documentum since I started working with it 9 years ago, but D6 breaks all that. For the record there is no dmcl.ini in D6 - all the parameters that used to be in dmcl.ini now have equivalents in the dfc.properties.


Performance Profiling for D6

April 28, 2008

As many of you will be aware, whilst the traditional approach to performance tuning Documentum prior to D6 was to generate and analyse a DMCL trace this option is no longer available in D6. As part of the effort to remove reliance on native code libraries (ie dmcl40.dll or equivalent Unix shared libraries) EMC has rewritten the DMCL layer in Java.

Iin effect the API no longer exists., there are no create,c,…/get,c,…/getfile,c,…/etc calls anymore. DFC code now calls directly into the RPC layer (now rewritten in Java as opposed to the native code Netwise implementation). API-aware utilities such as iapi and dmbasic still have support for API calls but in effect this is a layer on top of the DFC code (rather than underneath it as it used to be).

The standard way to analyse performance problems in D6 is to create a DFC trace and analyse that. Not only is this format very different from the DMCL trace format but there is also substantial change and improvement over the Documentum 5 DFC trace format. There are now around 34 different tracing parameters that can be set in the dfc.properties which give you a lot of control over what is output and in what format.

This week we have uploaded the beta version of the Xense Profiler for DFC. Xense Profiler for DFC is a port of the Xense Profiler, a Documentum 5 DMCL trace analyser. Xense Profiler for DFC is designed to work with D6 DFC trace files and produces performance profiling reports that allow you to identify the causes of D6 performance problems. Xense Profiler for DFC beta is available for download and is free for the duration of the beta period (Until 31 July 2008). If you are already working on D6 products I would urge you to try it out; as an incentive all users who register for the download will receive a substantial discount off the list price when the production version is released.


Using Jython to create users

April 4, 2008

A while back Bex Huff posted about using Jython to script Documentum tasks. I referenced that post in one of the answers I gave on the forums however I had never used Jython at that point. I made a note to try it out and so this post represents the beginning of my investigation. Consider a journey, possibly once started that never finishes….

First up you should be able to install Jython on any platform that runs Documentum (well any platform that runs DFC which is more or less the same thing). You can get the low down on the JPython site but the quick install steps are:

1) Download the installation, typically a file like jython_installer-2.2.1.jar
2) Execute the jar (there are command line invocations if you need it check on the web for details) to install.
3) For convenience add the installation path (e.g. c:\jython2.2.1)

All the instructions are based on a windows platform for convenience. It should translate to *nix without too much bother; I may try this out on a handy copy of Red Hat to confirm (but don’t hold your breath for that post!).

You could now run the Jython command interpreter in the following way:

1) start a command prompt
2) type ‘jython’

The following code will create a user called inline_user (it assumes you have a JRE of 1.4.2 or greater and DFC installed):


from com.documentum.com import DfClientX

clientx = DfClientX()
client  = clientx.getLocalClient()
li      = clientx.getLoginInfo()
li.setUser("dmadmin")            # login as installation owner
li.setPassword("dmadmin")         # insert install owner password here
s0      = client.newSession("fnet1",li)

print "Connected as dmadmin"

print "now creating user ..."

u = s0.newObject("dm_user")
u.setUserName("inline_user")
u.setUserLoginName("inline_user")
# because we have to
u.setUserAddress("inline_user@somewhere.com")
u.setUserState(0,0)
u.setUserSourceAsString("inline password")
# IDfUser.setPassword seems to be missing from DFC5.3!!!
u.setString("user_password","password")
u.save()

print "user inline_user has been created"

This code can either be typed directly into the Jython command line interface (a great way to test out dfc code snippets) or can be copied into a script file e.g. create_inline.py and called from the command line like this:


jython create_inline.py

Hopefully the user has been created sucessfully and you should now be able to test the login from the command line:


li2 = clientx.getLoginInfo()
li2.setUser("inline_user")
li2.setPassword("password")
s1 = client.newSession("fnet1",li2)
print "connected as " + s1.connectionConfig.getLoginUserName()

OK so this is a very simple example. There is no use of command line parameters, no user input, no functions and so on. Hopefully I’ll get round to further posts that expand and show examples of this. In the meantime enjoy.


Some thoughts about availability

February 29, 2008

Well so far 2008 has been incredibly busy. I’ve been involved with putting a major project live whilst also completing work on the new Xense Profiler for DFC. What with trying to have a life outside work that has left me with very little time to look at the support forums or to post here. Hopefully I’ll have a bit more time from March onwards.

Anyway February was looking a little empty so here’s a small thought. I’ve been spending time recently thinking about system availability. Now traditionally when you ask systems people to start thinking about these sorts of issues they immediately start thinking of resilience, load-balancing and clustering solutions. Now these definitely have their place but it’s useful to step back a bit and think about what we are trying to achieve.

What is availability?

If you look for a definition in a text book or somewhere on-line you are likely to find something similar to the following:

Availability is a measure of the continuous time a system is servicing user requests without failure. It is often measured in terms of a percentage uptime, with 100% being continously available without failure.

In reality 100% isn’t possible and you will see requirements quoted in terms of 99%, 99.9%, 99.99% and so on. The mythical uptime requirement is usually the ‘5 nines’ 99.999%, which works out at around 5 minutes per year. If you find yourself being asked for this then the requesting department had better have deep pockets.

Causes of unavailability

Behind this apparently clear and simply definition are a load of questions. If you think about it there are all sorts of reasons why a service could be unavailable:

  • Failure of a hard disk
  • Network failure
  • Operating system os crash
  • Software bug
  • Operator error (I usually call this the ‘del *.*’ problem)

Which of these could be protected by resilience? The first 3 could probably be solved by:

  • hardware resilience (RAID)
  • load-balancing and resilient network infrastructure
  • clustering

But what about the last 2? Most of the schemes mentioned above operate below the application layer. So problems like software bugs are not likely to be solved by load-balancing or clustering. In general these sort of problem need to be addressed by a monitoring and alerting system.

Operator errors, of course, are rather more difficult to cope with. Hopefully you have an adequate Disaster Recovery procedure that minimises the damage although in certain situations you are likely to lose some data. Even so typical Disaster Recovery procedures usually start in terms of hours possibly extending for days in the worst case scenario. As usual the more money you spend the lower the impact on the availability target.

So which of these is the most common. I would bet that the last one (the one it is most difficult to sucessfully protect against) is far more common that you might think. This is certainly the of view of Hennessy and Patterson in their book Computer Architecture: A Quantitative Approach.

Parting words

Well it’s nearly my dinner time so I have promised my long suffering family to finish here. When presented with those bland ‘the system must be available for xx.xxx%’ requirements make sure you (and more importantly your customer/business) realise the implications of what they are asking for.


Troubleshooting agent_exec garbage collection

January 17, 2008

There seem to be more and more posts on the forums about jobs ’stuck’ in the Running state and I have been investigating this problem for a client recently so I thought I would summarise some of the troubleshooting techniques I use. This posting expands on the article I wrote a few years ago about agent_exec.

The problem is usually expressed in the form of ‘DA shows my job is running but I know it’s not’. First of all DA shows a job as ‘Running’ whenever it finds a job whose a_special_app attribute is set to ‘agentexec’. Since agent_exec sets this attribute when it starts and clears it when the job has finished, under normal circumstances this is a quite accurate reflection of whether a job is running or not.

However if the agent_exec processes are interrupted before clearing the attribute (if the box is rebooted or the content server hangs for instance) then the job object can be left with a_special_app = ‘agentexec’ and DA shows the job as running.

Of course the agent_exec attempts to deal with such a situation. Every time it wakes up to perform some processing it first runs a ‘garbage_collect_jobs’ routine. You won’t see much evidence of this in the logs unless you turn on agent_exec tracing (see my job scheduler article for details on how to do this). You will get the follow lines when there is nothing to garbage collect:

Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] garbage_collect_jobs
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec:  execquery,s0,F,
     SELECT ALL   r_object_id, a_last_invocation, ...
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_get:  getlastcoll,s0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_next:  next,s0,q0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec:  close,s0,q0

Basically agent_exec runs the following query:

SELECT ALL
              r_object_id, a_last_invocation,
              a_last_completion, a_special_app
FROM          dm_job
WHERE ( (     (a_last_invocation IS NOT NULLDATE)
             AND (a_last_completion IS NULLDATE))
             OR  (a_special_app = 'agentexec'))
AND     (i_is_reference = 0 OR i_is_reference is NULL)
AND     (i_is_replica = 0 OR i_is_replica is NULL)

If jobs are returned from this query and agent_exec can not match the job with an existing running job it will clean up the job object, unsetting a_special_app and setting a_last_invocation to the current time.

Here is some typical trace output in the agentexec.log file when I set the dm_LogPurge a_special_app attribute to agentexec.

This output show that this is the source of the infamous message
Detected while processing dead job dm_LogPurge: The job object indicated the job was in progress, but the job was not actually running. It is likely that the dm_agent_exec utility was stopped while the job was in progress.

DMCL tracing dm_agent_exec

Examining the agentexec trace is usually enough to figure out where the problems lies however in extreme cases it is useful to look at the dmcl trace for the agentexec process to further troubleshoot issues. In principle you can do this by setting the dmcl.ini trace_file parameter to an existing directory on the Content Server. However this has the disadvantage of turning on tracing for all dmcl processes on the content server i.e. all jobs and methods.

What we really want to do is isolate the agentexec process from all others and in this section I tell you how. I present the steps along with explanations for a typical Windows server. The same principle applies to *nix servers usually with a suitable change of folder paths.

First force the agent exec to stop. You can do this by killing the main agent_exec process repeatedly. The Content Server will detect that the agent exec dies and try and restart it, however there is a limit to the number of times this will happen (seems to be 5 by default). Eventually you get the following message in the content server log and the dm_agent_exec stays dead:

Thu Jan 17 13:35:37 2008 984000 [DM_SESSION_W_AGENT_EXEC_FAILURE_EXCEED]warning: “The failure limit of the agent exec program has exceeded. It will not be restarted again. Please correct the problem and restart the server.”

Copy the agent_exec executable to a separate directory. Copy the program file %DM_HOME%\bin\dm_agent_exec.exe to a new directory e.g. c:\Documentum\agentexec.

Copy the dmcl.ini. Copy the main dmcl.ini file in c:\windows to c:\Documentum\agentexec. Now edit the file and add the following lines:


trace_level = 10
trace_file = c:\Documentum\agentexec

We are going to take advantage of the fact that the first place the dmcl looks for the dmcl.ini is in the current working directory.

Start the agent_exec from the command line. Use the following syntax:

dm_agent_exec -docbase_name docbase  -docbase_owner dmadmin -trace_level 1

Agent exec logging and trace output will continue to appear in the %DOCUMENTUM%\dba\log\agentexec\agentexec.log, however a number of dmcl trace files will also be created in C:\Documentum\agentexec directory. One of these (probably the largest) will be the dmcl trace for the main agent_exec process; remember agent_exec works by forking off a new dm_agent_exec process to manage each running job - each of these processes will have its own dmcl trace file.

When you have finished tracing the agentexec you will need to kill the command line process and restart the Content Server (if anyone knows how to force the content server to restart the agentexec after the failure limit has been reached I’d love to know).

Conclusion

With a clear understanding of how agent_exec works and with the trace output available it should be possible to troubleshoot and resolve just about any job scheduler related problem.


Xense Profiler v1.3 beta

December 19, 2007

Xense Profiler v1.3 beta is now on the Xense website for download by existing customers. Prospective customers can purchase Xense Profiler from the Xense website or via our reseller Dell ASAP.

There are a number of changes in 1.3 but the key ones are:

  • A new Top Queries section on the summary report. This lists the 10 ten longest running queries (using the true query duration).
  • Performance improvements especially for parsing dmcl 5.3 traces

The Top Queries report is a great addition to the summary report. The Query Summary is excellent for getting an understanding of all the queries being issued by a Documentum application; it allows you to spot patterns in the queries being run. However when your application is suffering from 1 or 2 slow-running queries the Top Queries report allows you to quickly home in on the culprit.

The performance improvements are much needed when processing 5.3 traces. If you examine 5.3 dmcl trace files you will find there is a lot of extra debug type information whenever a collection object is transferred from the Content Server to the client. A bug in Xense Profiler v1.2 caused excessive CPU usage whenever the new 5.3 trace lines were being processed.