I recently worked with a client to resolve a significant performance issue related to painfully slow queries in an Oracle database. One of the queries that they wished to execute was taking over five hours to run.
This is an Oracle 10g database running on a Solaris platform. The server has 8 processors so there is plenty of processing power. The database is very large with a very high insertion rate. The database is about 10 Terabytes and most of the data gets inserted into one large table and is partitioned by date, with a few partitions per day. Therefore the query, which specifies a small date range should only need to look in one or two partitions, so queries should be relatively fast.
Before I became involved, the in-house dba looked at a number of the typical issues. He confirmed that there were appropriate indexes on the table, and that the query was expected to access only one partition. The cause of the significant execution time was not clear.
When I became involved, I was given limited access to the server. I was not given DBA access but I did have select access to the catalog tables so I was able to poke around a bit.
I also received execute privilege to dbms_workload_repository so I could create AWR snapshots and reports. Fortunately, the client had set up hourly snapshots so I was able to select snapshots before and after the query ran. The resulting report is where I found the clue to one cause of the problem.
I ran a report for an 9 hour period during which the query executed.
The top wait event in the AWR report for the day was log sync waits. These were very high, 1,104,069 with a total wait time of 72,781 seconds.
There are a number of causes of log file wait events. Some of these are described in this excellent post.
However, in this case, the cause was something else.
Because of a similar situation that I saw in the past with very high insertion rates, I thought that frequent log file switches may be the cause. I checked v$loghist and found that the logs were switching every five seconds! The optimal frequency of log switches is twenty minutes so this clearly a problem.
The log size was set at 131M which was fine in the past when the insertion rates were lower, but as the rates grew, the logs were never increased to handle the higher rates.
We first increased the logs to 1G and the log switches dropped to every 2-3 minutes. Much better but still far too high. We increased them again to 4G and now the logs switch every 30-40 minutes. This is higher than it needs to be and we will now reduce the size a bit in order to reduce the recovery time in the event of instance failure.
After this change, the log file sync rates disappeared from the top wait events and the system performance has improved significantly. The query that was taking 4-5 hrs is now taking about 3 minutes.
I cannot be certain that insufficient log file size was the only cause of the performance problems because other steps were taken simultaneously including refreshing statistics which can have a significant impact as well.
However, I am certain that the insufficient log size was the cause of the log file sync waits And with a total of 72,781 seconds of wait time over nine hour period (that is about 2.25 hours of wait per processor if it is distributed evenly), it very likely had a significant impact on overall system performance as well.