One of my colleague asked me about how he can write better SQL. Despite he is a java developer, I appreciate his will to understand database. When I said, “I really can not teach you this in ten minutes”, he lost his will. Any way, Lets talk about 10053 trace file. Oracle dumbs its execution path of optimizer on runtime to disk. It is a human readable document, but not so “user-friendly”. No problem, I like “developer-friendly” programs. If I want to join two tables in SQL, what will oracle do? Well it has some abilities to join, compares these ways, and selects the best execution path. Once upon a time, in the ancient days when I was learning html-16 years old, Oracle used to look at tables and look at SQL and without making a comparison between its execution paths, picks up one of the execution plans based on a decision tree, called RuleBasedOptimizer. Now it is dead, CostBasedOptimizer is the king. 10053 has many text about behavior of CBO. There are 3 main ways to join tables.
First is nested loop. For each row in first table, go and find a matched pair in second table. How we reach each tables? Well, it really depends on table type, index type, data characteristic, clustering factor, using histogram or not, update or out of date statistical data about tables, server parameters like, having asynchronous I/O, single datablock read count, multiple datablock read count, current workload etc… Nested loop is commonly used way to join tables.
Hashing bases on regular hash functions. Hash function is an irreversible function, has O(1) complexity. If we compare hashing with nested loops, NL has O(m*n) complexity (m = row size of first dataset from first table, n = same for second table, in oracle terminology call it cardinality). Oracle reads first dataset creates a hash table, it may have collusion. And read second dataset, tries to hit hash table with second datasets values. When a hit occurs makes a exact comparision between values that have the same hash value. That is all, but the real problem is size of hash table. “OK, lets do it bigger without collusion”, no databases do not serves to one client, we need to it using least resource as we can, for happy end users. If given hash table size for a dataset does not enough to hold each dataset, oracle dumps partially hast table and datasets to disk and acts like an acrobat. One may say “Hashing has O(1) access, however NL has O(m*n), why do we need NL?”. Because hashing may use much more CPU and memory. O() thing is academic, forget about it when you graduate from collage.
Merge is a basically a NL with sorting. In fact it is all about sorting. We take the first dataset and sort it, and vice-versa. After sorting each dataset it is much more easier to join. It is not needed to do a sort when optimizer chooses to use index access to read a join column, because indexed leafs are already sorted on disk. Also sorting of each dataset do not have to be done in a serial fashion. Like hash table area of hashing, merging makes use of sort area. Of course same problems and similar solutions applied for size of sort area.
I will use an example that bases on an example by Jonathan Lewis (He gave me permission about that by email). Lets create two tables. (I use Oracle XE with very default parameters)
drop table build_tab purge;
drop table probe_tab purge;
create table probe_tab
as
select
10000 + rownum id,
trunc(dbms_random.value(0,5000)) n1,
rpad(rownum,20) probe_vc,
rpad(‘x’,500) padding
from dual connect by level <= 15000;
alter table probe_tab add constraint pt_pk primary key(id);
create table build_tab
as
select
rownum id,
10001 + trunc(dbms_random.value(0,5000)) id_probe,
rpad(rownum,20) build_vc,
rpad(‘x’,500) padding
from all_objects
where rownum ‘HR’, tabname => ‘probe_tab’);
begin
dbms_stats.gather_table_stats(ownname => ‘HR’, tabname => ‘build_tab’);
dbms_stats.gather_table_stats(ownname => ‘HR’, tabname => ‘build_tab’);
end;
/
I gathered table stats, oracle now have the chance of tasting the data. With the help of analysis of tables, CBO have updated knowledge about join operation.
SQL> select name, display_value from v$parameter where name like ‘%block%’;
NAME DISPLAY_VALUE
——————————————————————————– ——————————————————————————–
db_block_buffers 0
db_block_checksum TRUE
db_block_size 8192
db_file_multiblock_read_count 128
db_block_checking FALSE
SQL> select name, display_value from v$parameter where name like ‘%area_size%’;
NAME DISPLAY_VALUE
——————————————————————————– ——————————————————————————–
create_bitmap_area_size 8388608
bitmap_merge_area_size 1048576
hash_area_size 131072
sort_area_size 65536
workarea_size_policy AUTO
SQL> select name, display_value from v$parameter where name like ‘%pga%’;
NAME DISPLAY_VALUE
——————————————————————————– ——————————————————————————–
pga_aggregate_target 190M
These are some of my parameters. db_file_multiblock_read_count, db_block_size dictates oracle to choose multiblock read or single block read. Lower multiblock read cost means we will see more full table scans. Blocks are atoms of Oracles. When oracle want to do I/O it will be at least the size of a db_block_size. Any change in these parameter will change everything in system.
Once upon a time workarea_size_policy is not auto. If it is not auto, hash_area_size and sort_area_size will dictate hashing and sorting area sizes. No matter of how loaded is your system, oracle will use these specified area sizes. Does not this sound a bit hazardous. Without any dependency of connected client count of server, each client will have the same size of area, even their operations are not the same. In fact it is possible to override these paramters in session. It is 21th century, we have workarea_size_policy=auto parameter. This means “keep sum of pgas near pga_aggregate_target value”. It is an upper limit, an individual pga area can have %5 or %10 (I always confuse), a user may be using parallel execution, each slave is limited to this percentage, of course there is a limit for slave numbers. So, it is vital to know about these parameters before reading trace file.
alter session set events ‘10053 trace name context forever, level 2′;
alter session set tracefile_identifier = ‘compare’;
Here comes “dump the trace file” command. As you see it is session level, all execution will be dumped to %ORACLE_HOME%\app\oracle\admin\XE\udump directory.
– 543
select /*+ USE_NL (bt pt) */ *
from build_tab bt, probe_tab pt
where bt.id between 1 and 500
and bt.id_probe = pt.id;
– 367
select *
from build_tab bt, probe_tab pt
where bt.id between 1 and 500
and bt.id_probe = pt.id;
– 1293
select /*+ USE_MERGE (pt bt) */ *
from build_tab bt, probe_tab pt
where bt.id between 1 and 500
and bt.id_probe = pt.id;
these are three queries I mentioned above using NL, Hash and Merge join strategies. Lets have a look at some parts of trace output:
SQL:******* UNPARSED QUERY IS *******
SELECT /*+ USE_NL (“PT”) USE_NL (“BT”) */ “BT”.”ID” “ID”,”BT”.”ID_PROBE” “ID_PROBE”,”BT”.”BUILD_VC” “BUILD_VC”,”BT”.”PADDING” “PADDING”,”PT”.”ID” “ID”,”PT”.”N1″ “N1″,”PT”.”PROBE_VC” “PROBE_VC”,”PT”.”PADDING” “PADDING” FROM “HR”.”BUILD_TAB” “BT”,”HR”.”PROBE_TAB” “PT” WHERE “BT”.”ID”>=1 AND “BT”.”ID” best plan cost
***********************
(newjo-stop-1) k:0, spcnt:0, perm:2, maxperm:2000
*********************************
Number of join permutations tried: 2
*********************************
(newjo-save) [1 0 ]
Final – All Rows Plan: Best join order: 1
All these rows means, Oracle compared left to right join and right to left. Outer table means use this table as driver table. Iterate over its rows. So oracle makes a single access to this table. Here best outer table is BUILD_TAB with single table access cost of 42.11. Driven table will be accessed driver datasets row cardinality times. Without using index, accessing cost to PROBE_TAB + 42.11 is 161367.82, on the other hand if we use index making a unique index access cost is 1, accessing 500 separate rows is 500. So 500 + 42 = 542, far far far from 161367.82 and “Best join order: 1″.
============
Plan Table
============
————————————————-+———————————–+
| Id | Operation | Name | Rows | Bytes | Cost | Time |
————————————————-+———————————–+
| 0 | SELECT STATEMENT | | | | 543 | |
| 1 | NESTED LOOPS | | 500 | 518K | 543 | 00:00:07 |
| 2 | TABLE ACCESS BY INDEX ROWID | BUILD_TAB| 500 | 259K | 42 | 00:00:01 |
| 3 | INDEX RANGE SCAN | BT_PK | 500 | | 3 | 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID | PROBE_TAB| 1 | 530 | 1 | 00:00:01 |
| 5 | INDEX UNIQUE SCAN | PT_PK | 1 | | 0 | |
————————————————-+———————————–+
Predicate Information:
———————-
3 – access(“BT”.”ID”>=1 AND “BT”.”ID”2->5->4->1->0. Pruning occurs on index access times (3) and (5).
Current SQL statement for this session:
– 367
select *
from build_tab bt, probe_tab pt
where bt.id between 1 and 500
and bt.id_probe = pt.id
Here is the second SQL, I did not used any hint, because CBO selects hashing for this SQL. Same SINGLE TABLE ACCESS PATHs and BASE STATISTICAL INFORMATION about tables are printed.
HA Join
Outer table:
resc: 42.11 card 500.10 bytes: 530 deg: 1 resp: 42.11
Inner table: PROBE_TAB Alias: PT
resc: 324.52 card: 15000.00 bytes: 530 deg: 1 resp: 324.52
using dmeth: 2 #groups: 1
Cost per ptn: 0.84 #ptns: 1
hash_area: 0 (max=0) Hash join: Resc: 367.48 Resp: 367.48 [multiMatchCost=0.00]
HA cost: 367.48
resc: 367.48 resc_io: 364.00 resc_cpu: 15939091
resp: 367.48 resp_io: 364.00 resp_cpu: 15939091
Best:: JoinMethod: Hash
Cost: 367.48 Degree: 1 Resp: 367.48 Card: 500.10 Bytes: 1060
[..skipped..]
HA Join
Outer table:
resc: 324.52 card 15000.00 bytes: 530 deg: 1 resp: 324.52
Inner table: BUILD_TAB Alias: BT
resc: 42.11 card: 500.10 bytes: 530 deg: 1 resp: 42.11
using dmeth: 2 #groups: 1
Cost per ptn: 399.93 #ptns: 1
hash_area: 0 (max=0) Hash join: Resc: 766.57 Resp: 766.57 [multiMatchCost=0.00]
HA Join (swap)
Outer table:
resc: 42.11 card 500.10 bytes: 530 deg: 1 resp: 42.11
Inner table: PROBE_TAB Alias: PT
resc: 324.52 card: 15000.00 bytes: 530 deg: 1 resp: 324.52
using dmeth: 2 #groups: 1
Cost per ptn: 0.84 #ptns: 1
hash_area: 0 (max=0) Hash join: Resc: 367.48 Resp: 367.48 [multiMatchCost=0.00]
HA cost: 367.48
resc: 367.48 resc_io: 364.00 resc_cpu: 15939091
resp: 367.48 resp_io: 364.00 resp_cpu: 15939091
Again the same question “left or right”? Here we use SINGLE TABLE ACCESS costs for each table, because we do not need to sort or pickup rows indivudially. resc = cost of serial access. deg = degree of parallelism. resp = cost of parallel access, because of deg=1, resc=resp. total best cost is 367.48. Substract SINGLE TABLE ACCESS costs. 367.48-(42.11+324.52) = 0.85. Building a hash table for this operation is free I think. And the answer of “left or right”. In hashing left means builder of hash table, right means user of hash table. Small dataset is builder of hash table, BUILD_TABLE. Also I skipped many columns, I have given no hint so there was merge cost calculation in these rows.
Plan Table
============
————————————————-+———————————–+
| Id | Operation | Name | Rows | Bytes | Cost | Time |
————————————————-+———————————–+
| 0 | SELECT STATEMENT | | | | 367 | |
| 1 | HASH JOIN | | 500 | 518K | 367 | 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID | BUILD_TAB| 500 | 259K | 42 | 00:00:01 |
| 3 | INDEX RANGE SCAN | BT_PK | 500 | | 3 | 00:00:01 |
| 4 | TABLE ACCESS FULL | PROBE_TAB| 15K | 7764K | 325 | 00:00:04 |
————————————————-+———————————–+
Predicate Information:
———————-
1 – access(“BT”.”ID_PROBE”=”PT”.”ID”)
3 – access(“BT”.”ID”>=1 AND “BT”.”ID”=1 AND “BT”.”ID”<=500)
I did not dive into deeper. If you need more about NL/HASH/Merge, turkcellstaj grup having a two day seminar at this weekend at Turkcell Maltepe. Seminar’s topic will be CBO, our primary resource is Mr. Lewis’es CBO book. I will be talking about this topic at sunday afternoon. All seats are reserved
.