Tags: all_rows, documentum, first_rows, optimizer mode, oracle
This post is inspired by a number of recent occasions where I have been asked for advice on what setting should be used for the Oracle optimizer_mode parameter. It is an expanded version of a posting I made to this thread in the EMC Documentum support forums.
In the beginning …
Once upon a time Oracle only had a RULES-based optimizer. Given a query it would work out what indexes to use and what order to join tables using a set of rules. The rules were based on features of the tables (such as what indexes they had) and the elements of the query (e.g. columns used in the WHERE clause). It didn’t make use of statistics about how many rows and data blocks each table and index consumed.
With version 8 (or maybe somewhere during the version 7 product cycle can’t remember) Oracle introduced the Cost-Based Optimizer (CBO). CBO uses statistics on tables, indexes and other objects and processes this information to model the expected resource cost of different ways of executing the query. For even moderately complex queries there are literally thousands of combinations (called query execution plans) of table join methods, table access methods and join orders. The CBO simply chooses the one with the lowest cost according to its model.
The CBO is the reason why up-to-date statistics are absolutely essential. Irrespective of whether you use Documentum’s own dm_UpdateStats job, Oracle’s stats gathering job or a custom script, you need to have statistics that reflect the volume and distribution of the data. Where row volumes are volatile, statistics will need to be updated with sufficient frequency to ensure the CBO has the information necessary to make the correct decision. If you are using Oracle 10 or 11 for your Documentum implementation you should be using the CBO.
In principle this is a brilliant and sophisticated piece of software however like all models it doesn’t always represent the ‘truth’ and can make ‘mistakes’ when your data isn’t consistent with the assumptions the CBO makes.
The CBO has a number of different optimizer modes as of 10G:
- FIRST_ROWS_n (where n can be 1, 10, 100)
ALL_ROWS is the default option and the CBO tries to find an execution path that minimizes the cost in terms of CPU and disk io to deliver all the rows.
FIRST_ROWS_1/FIRST_ROWS_10/FIRST_ROWS_100 try and find an execution path that returns the first 1/10/100 rows quickest even if the overall resource cost of returning all the rows is higher than ALL_ROWS cost. This sounds like it should be faster for things like WDK screens returning the first 50 objects in a large folder but the reality is more sobering.
The basic strategy for returning the first rows quickly involves avoiding blocking operations such as sorting or joining tables using hash joins. For example, suppose I have the following DQL:
select * from dm_sysobject where any i_folder_id = ‘0b00000180000111’ order by object_name
In Oracle dm_sysobject is implemented as a view joining the tables dm_sysobject_s and dm_sysobject_r. i_folder_id is on dm_sysobject_r and will have an index created during the repository setup. object_name is on dm_sysobject_s (also with an index). The quickest way to execute this query – assuming i_folder_id is a relatively selective predicate which would be the case for most Documentum datasets – is:
- Retrieve all the dm_sysobject_r rows that satisfy the predicate using the index on i_folder_id
- join to dm_sysobject_s
- sort the result
Clearly we would have to execute the whole query before we can sort it so in effect we have to wait almost the entire execution time to get the first row (ie sorting is a blocking operation). However all the other rows follow quickly afterwards.
What is the resource cost of this execution plan? Well there is a range scan on the i_folder_id index but assuming that i_folder_id is selective retrieving all the rows is relatively cheap. Each row is then matched to the corresponding row in dm_sysobject_s using r_object_id – each match is very quick as the table is ‘probed’ using a unique index on r_object_id. The end result is likely to be 4 or 5 blocks visited for each row returned. Returning 100s of rows in this way would typically take a fraction of a second.
The only way to avoiding the blocking sort operation is to drive the query through an index on the object_name (the ‘order by’ criteria). To do this we would:
- Scan the whole index on dm_sysobject_s.object_name to retrieve all the dm_sysobject_s rows in object_name order
- For every row retrieved a join is made to the corresponding row in dm_sysobject_r and checked to see if it meets the i_folder_id predicate
- If it does meet the predicate we can immediately return the row to the caller before moving on to find the next row
We do not have to execute the whole query before returning the first row(s); there is no blocking operation.
However intuitively this should seem much more work for anything other than the smallest system. dm_sysobject_s is usually 100,000s of rows for small systems and can grow to 100Ms in large systems. Whilst the size of the index on object_name is a fraction of the size of the table itself it will still be a significant fraction. Scanning the whole or a sizeable part of the index can be time consuming. Even worse we have to visit a block in dm_sysobject_r for every row returned by dm_sysobject_s!
This problem arises whenever the attribute in the order by clause is on a different table to the attribute in the selective predicate. For even moderately-sized repositories such an execution plan would seem sluggish to the end user. This is typical of the sorts of queries produced by WDK folder navigation components.
In such cases FIRST_ROWS_n typically comes up with identical execution plans compared to ALL_ROWS.
CHOOSE is an old optimizer mode that used rule-based optimizer if there were no statistics on any of the tables in the query, and ALL_ROWS if at least one table had statistics. Quite useful during the transition era between RBO and CBO but now deprecated.
FIRST_ROWS – this was an early attempt to provide the functionality now accessible via FIRST_ROWS_n. CBO would run it’s model as usual but only consider certain combinations in an attempt to provide execution paths with quick response time.
One of the heuristics it seems to use is the avoidance of blocking operations if possible. For example if you process the query discussed above using the FIRST_ROWS optimizer setting it will probably take the bad non-blocking path.
FIRST_ROWS is deprecated by Oracle in 10 and 11 and I would advise people to never, ever use this mode (despite the fact there is at least one posting on the contology Documentum forum that suggests a performance improvement by using FIRST_ROWS!).
- Use ALL_ROWS or FIRST_ROWS_10
- Make sure you are updating statistics on database objects frequently
- Never use FIRST_ROWS
- Don’t confuse FIRST_ROWS_10 with FIRST_ROWS, there’s a big difference
- Avoid using RULE unless you happen to be on a really old Documentum platform such as EDMS98 or 4i (and therefore using an old version of Oracle)
Tags: oracle, startup, troubleshooting
This is a very windows specific solution however similar things could occur on *nix
This is a problem I ran into while investigating an oracle problem. I had a VMWare image that I used to run Oracle and Content Server. Content Server had always run fine on it whenever I switched the VM on. Then this morning it didn’t. Coincidentally I had just installed a new Oracle home (ie installation) to test out some issues to do with a particular Oracle patch version. Too much of a coincidence? Of course.
The problem actually arose when I subsequently tried to create a new docbase for an entirely different investigation. I started the docbase config program entered all the information until I got to the bit where it takes the database password and tests the database connection – FAILURE.
I did a couple of tests, first of all start up SQL Plus and try and connect to the database, in this case the database connect identifier is DB1:
SQL> connect system/@DB1
ORA-12154: TNS: could not resolve the connect identifier specified
OK so SQL Plus seems to have a similar problem to the Content Server config program. So What happens when you type connect …/…@DB1? Well assuming you are using the standard TNSNAMES name resolution it looks for the file tnsnames.ora in the %ORACLE_HOME%\network\admin folder and uses that to convert DB1 into a request to the TNSListener.
The problem was I hadn’t defined ORACLE_HOME – never needed to – so presumably it was using a default value that had now changed after the new oracle home installation.
So I set an explicit oracle home environment variable:
and that fixed SQL Plus. By opening up the control panel System app I set the environment variable for the dmadmin user so now the configuration program , and all the docbases could start.
Update: every Documentum on Oracle installation I have seen has used TNSNAMES, using something like Oracle Internet Directory (OID) is not a good idea according to support note esg97155.
Tags: documentum, oracle, rac
It’s a tricky business keeping Release Notes up-to-date. I’ve just been browsing through the latest D6 SP1 Release Notes and my attention was caught by a small section about installing on Oracle RAC:
Installing with Oracle Real Application Clusters (121442)
If you are installing Content Server with Oracle Real Application Clusters (RAC), set
the value of the Oracle parameter MAX_COMMIT_PROPAGATION_DELAY to 0 (zero).
This value is required to ensure that the data that Content Server uses is consistent across
all Oracle nodes. Values other than zero are not supported.
I presume this has been in the various Content Server release notes for a while and it would have been important as using the default (or any other value here) uses a commit scheme that can delay other Oracle nodes from seeing changes. Since a single Content Server session often uses more than one database session (if you use DF_EXEC_QUERY in a DFC query call you are asking Content Server to start a new Oracle session) and those sessions could be attached to 2 different Oracle RAC nodes the delay in seeing recently changed values could cause havoc.
Now I know what your thinking, since we would obviously prefer to have data available immediately why would Oracle not use 0 as the default and why wouldn’t everyone just set it to 0 anyway? The answer of course is that there is a cost to be paid in performance; having to issue network calls to propagate information about a commit could be very costly (certain platforms seemed to have real problems with it), so in many cases the the default setting was fine provided your Oracle application didn’t have functional problems.
However since Oracle 10g Release 2 this parameter is deprecated – Oracle now uses the ‘broadcast on commit’ behaviour implied by MAX_COMMIT_PROPAGATION_DELAY=0 automatically. The best overview of this change is given in this paper by Dan Norris. Since the Content Server in D6 is only certified for Oracle on 10g Release 2 the entry shown above in the Release Notes is no longer needed. In fact it you could argue it is positively harmful. As they say, forewarned is forearmed.
By the way I stumbled on this bit of information whilst perusing the ‘Support by Product’ section on Powerlink. It is currently under beta and contains amongst other things a section for Documentum Content Server. It’s basically a portal type view of Content Server support information, bringing together support notes, whitepapers, documentation (there’s a very nice Top manuals section) and so on. I think it’s a brilliant idea and I urge everyone to have a look and provide feedback to the support team.