New Way

From DB Optimizer
Jump to: navigation, search

Database Tuning


DB Optimizer's methodology grew out of impossible predicament presented by the defacto method of database tuning. The standard method was trying to collect 100% of the statistics a 100% time. Trying to collect all the statistics as fast as possible ends up putting load on the monitored database and creating problems. Stories of problems created by database monitoring products abound in the industry. In order to avoid putting load on the target database, performance monitoring tools have to collect less often as a compromise. Oracle compromised in 10g with AWR (their automated performance data collector) only running it once an hour because of the the performance impact. Not only is the impact on the monitored target high, but the amount of data collected is staggering, but the worst problem of all though, is the impossibility of correlating statistics with the sessions and SQL that created the problems or suffered the consequences.

The solution to collecting performance data required letting go of the old stodgy problematic paradigm of trying to collect as many performance counters possible as often as we could and instead freeing ourselves with the simple approach of sampling session state. Session state includes who the session is, what their state is (active, waiting, and if waiting, what they are waiting on) and what SQL they are running. The session sampling method was used by me occasionally over the years and I even wrote up a method to access this information without SQL on Oracle by reading the live database memory via C code. The session state method was officially packaged by Oracle in 10g when they introduced "Active Session History" (ASH). ASH is an automated collection of session state sampling. The rich robust data from ASH in its raw form is difficult to read and interpret. The solution for this was "Average Active Sessions" (AAS). AAS is a single powerful metric which measures the load on the database based on the ASH data. AAS data provided the perfect road map for what data to drill into . The main drill downs are "top sql", "top session" , "top event" and "top objects".

Other aggregations are possible based on the different dimensions in the ASH data.


Here is an example screen shot of the same batch job being run 4 times. Between each run performance modifications are made based on what we see in the in the profiling load chart:

Tuning a database.PNG Run:

1. "log file sync" so we moved the log files to a faster device. (you can see the checkpoint activity just after run 1 where we moved the log files)
2 "buffer busy wait" so we moved the table from a normal tablespace to an Automatice Segmenet Space Managed tablepace.
3."log file switch (checkpoint incomplete)" so we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
4. the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advanteage of all the processors and run the batch job as quickly as possible.

Conclusion: With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. In the first run, almost all the time is spent waiting, in the second run were we eliminated on bottleneck we actually spent more time - ie the bottleneck was worse. Some time this happens as eliminating one bottleneck cause great contention on the next bottleneck.(you can see the width of the run, the time it ran, is wider in run 2). By the third run, we are using more CPU and the run time is faster and finally by the 4th run all the time spent is on CPU, no waiting, and the run is the fastest by far. What's happening on the databases? Is the database Idle, working, bottlenecked?

When a bottleneck happens how can you know whether

  • It's an application problem
  • Machine is undersized
  • SQL requires optimization
  • Database is mis-configured

All of this can be easily determined form DB Optimzer's performance profiling screen:


Let's look at the components of the screen


The screen has 6 important parts

1. Databases
2. Average Active Sessions (AAS) Load of selected database
3. Maximum CPU line
4. Top SQL
5. Top Bottlenecks
6. Top Sessions

First, on top left, is a list of our databases we have registered.

The most important part of the screen is the Average Active Sessions (AAS) graph. AAS shows the performance of the database measured in the single powerful unified metric AAS. AAS easily and quickly shows any performance bottlenecks on the database when compared to the Maximum CPU line. The Max CPU line is a yardstick for performance on the database. When AAS is larger than the Max CPU line there is a bottleneck on the database. Bottleneck identification is that easy.

AAS or the average number or sessions active, shows how many sessions are active on average (over a 5 second range in DB Optimizer) and what the breakdown of their activity was. If all the users were running on CPU then the AAS bar is all green. If some users were running on CPU and some were doing IO, represented by blue, then the AAS bar will be partly green and partly blue.

The line "Max CPU" represents the number of CPU processors on the machine. If we have one CPU then only one user can can be running on the CPU at a time. If we have two CPUs then only 2 users can be on CPU at any instant in time. Of course users can go on and off the CPU extremely rapidly. When we talk about sessions on the CPU we are talking about the average number of sessions on CPU. A load of one session on the CPU, thus would be an average which could represent one uses who is consistently on the CPU or many users who are on the CPU for short time slices. When CPU becomes a resource bottleneck on the database we will the average active sessions in CPU state go over the Max CPU line. The number of sessions above the max CPU line is the average number of sessions waiting for CPU.

The Max CPU is a yardstick for performance on the database.

In the above chart, do you think it's an application problem or a machine resource problem?

In order to tell we have to find out where is that demand coming from. To find out where the demand is coming from we can look at Top SQL and Top Session tables below the load chart. In our case shown here the load of well distributed over all sql in Top SQL and all sessions inTop Session. There is no outlier or resource hog. In this case it's the machine that's underpowered. What does a case look like where we should tune the application?


In this case, again the CPU demand is more than the machine can supply but if we look at "Top SQL" we can see that the first SQL statement (with the large green bar) uses up much more CPU than any of the rest, actually 60%! If we could get it down to 10% CPU then we'd save 50% of the CPU usage on the machine ! Thus in this case it's worth our while to spend a day or week or even a couple weeks trying to tune that one SQL statement instead of buying a bigger machine.

Finally, how do we know when the database configuration is a problem? We know it's a configuration problem when we are seeing something other that CPU as the bottleneck in Top Bottleneck section. Here's an example:


In this case we can see the load is higher than the Max CPU line but the load is coming from brown colored bars and the the green CPU colored bars. If we look at Top SQL we see that there is only one SQL taking up almost all the load, but it's not because of CPU which would be a green bar, but some other color. What does this other color represent? We can look at the Top Bottleneck section and see that it is "log file switch (incomplete)" which basically means the log files are too small, ie the database is not correctly configured. This bottleneck was resolved simply by increasing the log size.

Tuning Example in 3 parts

Part 1: The Database is hanging! AKA "the application has problems"

I wonder if you can imagine, or have had the experience of the application guys calling with anger and panic in their voices saying "the database is sooo slow, you've got to speed it up."

What's your first reaction? What tools do you use? How long does it take to figure out what's going on?

Let's take a look at how it would work with DB Optimizer. When I get a call like this I take a look at the database with DB Optimizer:

Tune app slow 2cpua.PNG

I can clearly see that the database is not bottlenecked and there must be a problem on the application.

Why do I think it's the application and not the database? The database is showing plenty of free CPU in the load chart, the largest chart, on the top, in the image above. In the load chart, there is a horizontal red line. The red line represents the number of CPU's on the system, which in this case is 2 CPUs. The CPU line is rarely crossed by bars which represent the load on the database, measured in average number of sessions. The session activity is averaged over 5 samples over 5 seconds, thus bars are 5 seconds wide. The bars above fall mostly about 1 average active session and the bars are rarely green. Green represents CPU load. Any other color bar indicates a sessions waiting. The main wait in this case is orange, which is log file sync, ie waits for commits. Why is the database more or less idle and why are most of the waits we do see for "commit"? I look at the code coming to the database and see something like this:

insert into foo values ('a');
insert into foo values ('a');
insert into foo values ('a');
insert into foo values ('a');
insert into foo values ('a');
insert into foo values ('a');
insert into foo values ('a');

Doing single row inserts and committing after each is very inefficient. There is a lot of time wasted on network communication which is why the database is mainly idle, when the application thinks it's running full speed ahead, it is actually waiting mainly on network communication and commits. If we commit less and batch the work we send to the database, reducing network communications, we will run much more efficiently. Changing the code to

for i in 1..1000 loop
insert into foo values ('a');
-- commit;
end loop;

improves the communication delay and now we get a fully loaded database but we run into database configuration issues.

Part 2: It *is* the database (ie DBA get to work)

Tune four steps 2cpu annotated.PNG

In the above DB Optimizer screen, the same workload was run 4 times. We can see that the time (width of the load) reduced, and the percent of activity on CPU increased.


1. "log file sync" , the orange color, is the biggest color area, which means uses are waiting on commits, still even though we are committing less in the code. In this case we moved the log files to a faster device. (you can see the checkpoint activity just after run 1 where we moved the log files)
2 "buffer busy wait" , the burnt red, is the biggest color area. We drilled down on the buffer busy wait event in the Top Event section and the details tells use to move the table from a normal tablespace to an Automatic Segment Space Managed tablespace.
3."log file switch (checkpoint incomplete)" , the dark brown, is the largest color area, so we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
4. the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advantage of all the processors and run the batch job as quickly as possible.

Part 3: It's the machine (rock paper scissors)

Now that the application is tuned and the database is tuned let's run a bigger load:

Tune cpu maxed.PNG

We can see that the CPU load is constantly over the max CPU line. How can we have a bigger CPU load than there are actually CPUs on the machine? Because, this actually means that the demand for CPU is higher than the CPU available on the machine. In the image above there are 2 CPUs on the machine but and average of 3 users who think they are on the CPU, which means that on average 1 users is not really on the CPU but ready to run on the CPU and waiting for the CPU.

At this point we have two options - in this case we are only running one kind of load, ie the insert. For inserts we can actually go even further tuning this insert and use Oracle's bulk load commands:

for i in 1..8000 loop
end loop;
INSERT INTO foo ( dummy )
VALUES ( MY_IDX(indx) );

But if this was an application that had a lot of different SQL and the SQL load was well distributed across the system then we'd have a case for adding more hardware to the system. Making the decision to add more hardware can be a difficult decision because in general the information to make the decision is unknown, unclear or just plain confusing, but DB Optimizer makes it easy and clear, which can save weeks and months of wasteful meetings and debates. For example


If we look in the bottom left, there is no SQL that takes up a significant amount of load, ie there is no outlier SQL that we could tune and gain back a lot of wasted CPU. We'd have to tune many many SQL and make improvements on most of them to gain back enough CPU to get our load down below the max CPU line. In this case, adding CPUs to the machine might be the easiest and most cost affective solution.


With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. IN part 1 we had an application problem, in part 2 we had 3 database configuration issues and in part 3 we had a hardware sizing issue. In all 3 chapters DB Optimizer provides a clear and easy presentation of the data and issues making solutions clear.

SQL Tuning

Sql tuning.PNG

Finding and Tuning Problem SQL

DB Optimizer is targeted at finding problem sql in a running load with the profiler and then tuning that (or those) specific queries with the tuner.

It’s not efficient just to dump a bunch of procedure code into the tuner and then try and see if any of the SQL in the package or procedure are tunable. Most queries should, by default, run optimally on a database, so the goal of DBO is to tune those queries that for one reason or another are not optimally tuned by the database by default. The easiest way to find those queries is to identify them on a running system . They can be identified on a running system because they take up a lot of resources. If we find a resource intensive query then it’s worth the time to generate cases and analyze it for missing indexes to see if there is a way to tune it.