The Blog from the DBA Classroom

By: Joel Goodman

Mounting Failures With Your Failgroups?

Posted by Joel Goodman on 16/01/2012

Last year I wrote a post about mirroring in ASM in the context of Exadata which discussed aspects of ASM mirroring, including failgroups, and the ability of ASM to cope with different failure scenarios. The post dealt with cases such as:

  • Single disk failure in a single cell
  • Multiple disk failure in a single cell
  • Single cell failure
  • Overlapping disk failures in multiple cells

It also covered the management of free space in ASM.

But the assumption was that the diskgroup(s) were already mounted, when the failures occurred. If an ASM instance attempts to mount a diskgroup however, then things are very different.

I have been working closely with members of the Oracle Advanced Customer Support Services team over the past 18 months, to train EMEA staff in Grid Infrastructure, RAC, Exadata and Enterprise Linux, to help skill up for supporting the Oracle Database Machine.  Recently, this collaboration resulted in a joint project between Oracle University and ACS to triage some problems at an EMEA customer.

The customer’s configuration used Grid Infrastructure for a standalone server on a two node Solaris stretch cluster. This was done to utilise ASM on each node, to support a live environment on one node, and a test environment on the other node in a different location.  But the test node was also part of the HA strategy, in case of problems on the live node. Oracle Restart was not used, so essentially, it was two separate nodes each with Oracle and ASM.

The ASM disks, were provided by SAN LUNs, with arrays in both the live and test locations. Each array was connected to both the live and test servers.

The live system diskgroups each had two failgroups to support ASM normal redundancy

  • First failgroup on LUNs provided by the storage array at the live location
  • Second failgroup on LUNs provided by the storage array at the test location

Normally, the ASM instance on the live system would have two diskgroups mounted, each with two failgroups. But to handle live location failures, scripts were written, that would:

  • Shut the test database instances on the test server
  • Unmount the test database diskgroups in the ASM instance on the test server
  • Mount the live database diskgroups in the ASM instance on the test server
  • Start the live database instance on the test server

In effect, this was a cold failover of the database. But without Grid Infrastructure for a cluster and clustered ASM, the diskgroups could only be mounted by a single ASM instance at one time. Thus the live disk groups could be mounted on the test server, only if not currently mounted on the live one.

Test were done by the customer, demonstrating that the scripts worked properly when the live database node failed. Eventually after all testing was completed the system for which this database was created, became operational.

I was called in, partly to triage a failure of the same recovery scripts that had worked during testing, and also to do a health check on the database and ASM environments. What I discovered from the ASM and database logs, was that one of the failgroups, on the storage array at the live data centre, had gone offline. ASM continued to read mirror copies from the other failgroup,  as it is designed to do, whenever I/O for primary copies of ASM allocation units (AUs) on the offline failgroup were done. Eventually, the OS on the live server failed, possibly related to access to swap space or other system files on the array.

When the failover to the test system was invoked and the scripts executed:

  • The ASM instance failed to mount the live database diskgroups
  • Startup of the live database instance on the test server failed as it could not access the database files stored in the ASM diskgroups

An attempt was eventually made to fail back to the live server, once the live system OS was back up, but the same problem arose. When I examined the scripts, and the logs, it showed that the MOUNT command for the ASM diskgroups was missing the FORCE option. This is what caused the mount failures in both cases.

Mirrored ASM diskgroups, may survive the loss of ASM disks  or even a complete failgroup, when already mounted, but to mount a mirrored ASM disk group requires that each AU have at least two copies discovered for normal redundancy diskgroups. If one of more ASM disks, is not accessible, then the diskgroup will not mount, because the stated level of redundancy will not have been met. The FORCE option overcomes this problem, by requesting that the diskgroup be mounted, without providing the stated level of redundancy, should a subsequent failure of  an ASM disk occur.

Contributing to the problem, was the lack of testing for storage array failure. The tests had only been done on database node failure. No tests were done for storage array failure, storage network failure or for failure of any of the components upon which the storage depends. This was why the failure to mount the diskgroups in this situation, was not discovered during testing.

An alternative approach to the software architecture, would have been to use Grid Infrastructure for a cluster, rather than Grid Infrastructure for a standalone server. This would have meant the following:

  • The live database diskgroups could be mounted simultaneously from the ASM instances on both the live and test servers, because clustered ASM can do this.
  • No need to write custom scripts to mount the diskgroups on the test server would be needed.
  • The Grid Infrastructure could be used to control the “cold failover” of the non-RAC database instance from the live server to the test server
  • Since the diskgroups could be mounted already they would be accessible and the live database could start up normally on the test server

It was an interesting bit of troubleshooting, and served to re-enforce my classroom message about testing all failure scenarios when planning for High Availability. Instead of suffering mounting failures, aim for success.

Joel

1/2012

Posted in Oracle | Tagged: , , , , | 1 Comment »

Maintaining your Cell’s Image

Posted by Joel Goodman on 14/10/2011

I have been teaching the Oracle Exadata Database Machine Administration course for 2 years now and recently the one day seminar on Database Machine (DBM) monitoring. These two will be merged into a newly upgraded course soon that covers both topics and also new material on monitoring Infiniband switches, Cisco Switches, KVM hardware, patching and much more.

In the classroom, Oracle use a virtualised image of an X2-2 Quarter rack on each delegate desktop running under OVM. But the virtual machines for the Exadata storage servers also known as cells have some differences when compared to real storage servers:

  • The 12 Physical disks and associated LUNs are virtualised as files, rather than being device files as they are in a real cell
  • the 16 Flash FDOMs and associated LUNs are virtualised as files, rather than being device files as they are in a real cell
  • The sizes of these 28 files are smaller than the devices in a real cell
  • These is no virtualised Infiniband network. All the virtual machines in the virtual quarter rack use a virtual network under Oracle virtual machine (OVM)
  • There are no Infiniband switches, Cisco switches, KVM hardware, Power Distribution Units, or ILOMs to administer or monitor.
  • The root, boot and Oracle base file systems are not mirrored in different partitions on the first two hard disks of the cell. They are contained in directories on the virtual disk used by the virtual machine.

The classroom setup however is very good for the purpose of the course:

  • The Cell Software is the same as the software in real cells
  • The virtualised quarter rack permits delegates to lean how to administer for Exadata All the following:
    • Grid Infrastructure
    • ASM Disk Groups
    • Cell Discovery via the /etc/oracle/cell/network-config/cellip.ora and /etc/oracle/cell/network-config/cellinit.ora files
    • Cell administration using cellcli and dcli
    • Database I/O Resource management
    • Cell I/O Resource Management
    • Cell Security using Realms
    • Flash device configuration for Flash Cache
    • Flash device configuration for flash based disk groups
    • Flash device configuration for Smart Flash Logging
  • The Monitoring of the following is also available in the virtualised setup:
    • Cell Smart Scans
    • Use of Hybrid Columnar Compression (HCC)
    • The effect of Storage Indexes on I/O
    • The effect of Flash Cache on I/O
    • Interpreting cell oriented statistics
    • Interpreting execution plans and statement execution for Smart Storage operations
    • Monitoring the effect of I/O resource management

One of the common questions I get in the course concerns the configuration of the software on the cells. Cells contain “images” consisting of the following components:

  • Enterprise Linux OS kernel
  • The Root file system
  • The Oracle Software
  • Device Drivers
  • Firmware Golden Copies

But in the virtualised classroom setup it is not possible to demonstrate this, so I will use a real cell to show how this is configured.

The first two Hard Disks on each cell contain various partitions that are not present on the other ten Hard Disks which have no partition table.

Lets start by finding the device files associated with the 12 physical disks on the cell. We can use the cellcli utility for this:

CellCLI> list lun attributes diskType ,deviceName, lunSize,isSystemLun where disktype = HardDisk
         HardDisk        /dev/sda        557.861328125G   TRUE
         HardDisk        /dev/sdb        557.861328125G   TRUE
         HardDisk        /dev/sdc        557.861328125G   FALSE
         HardDisk        /dev/sdd        557.861328125G   FALSE
         HardDisk        /dev/sde        557.861328125G   FALSE
         HardDisk        /dev/sdf        557.861328125G   FALSE
         HardDisk        /dev/sdg        557.861328125G  FALSE
         HardDisk        /dev/sdh        557.861328125G  FALSE
         HardDisk        /dev/sdi        557.861328125G   FALSE
         HardDisk        /dev/sdj        557.861328125G   FALSE
         HardDisk        /dev/sdk        557.861328125G  FALSE
         HardDisk        /dev/sdl        557.861328125G   FALSE

We can see the 12 device files. /dev/sda and /dev/sdb correspond to disks 0 and 1 containing the “image” of  the cells. The “IsSystemLun” attribute shows that these two devices contain the image. This cell has 600G disks as the “lunSize” attribute indicates.

Here are the device files:

[root@dmorlcel05 ~]# ls -als /dev/sd[a-l]
0 brw-r—– 1 root disk 8,   0 Aug 27 02:22 /dev/sda
0 brw-r—– 1 root disk 8,  16 Aug 27 02:22 /dev/sdb
0 brw-r—– 1 root disk 8,  32 Oct 14 06:07 /dev/sdc
0 brw-r—– 1 root disk 8,  48 Oct 14 06:07 /dev/sdd
0 brw-r—– 1 root disk 8,  64 Oct 14 06:07 /dev/sde
0 brw-r—– 1 root disk 8,  80 Oct 14 06:05 /dev/sdf
0 brw-r—– 1 root disk 8,  96 Oct 14 06:07 /dev/sdg
0 brw-r—– 1 root disk 8, 112 Oct 14 06:07 /dev/sdh
0 brw-r—– 1 root disk 8, 128 Oct 14 06:07 /dev/sdi
0 brw-r—– 1 root disk 8, 144 Oct 14 06:07 /dev/sdj
0 brw-r—– 1 root disk 8, 160 Oct 14 06:07 /dev/sdk
0 brw-r—– 1 root disk 8, 176 Oct 14 06:07 /dev/sdl

Now lets look at some details of  these devices. First lets see the partition tables of the first two devices:

[root@dmorlcel05 grub]# fdisk -l /dev/sd[a-b]

Disk /dev/sda: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          15      120456   fd  Linux raid autodetect
/dev/sda2              16          16        8032+  83  Linux
/dev/sda3              17       69039   554427247+  83  Linux
/dev/sda4           69040       72824    30403012+   f  W95 Ext’d (LBA)
/dev/sda5           69040       70344    10482381   fd  Linux raid autodetect
/dev/sda6           70345       71649    10482381   fd  Linux raid autodetect
/dev/sda7           71650       71910     2096451   fd  Linux raid autodetect
/dev/sda8           71911       72171     2096451   fd  Linux raid autodetect
/dev/sda9           72172       72432     2096451   fd  Linux raid autodetect
/dev/sda10          72433       72521      714861   fd  Linux raid autodetect
/dev/sda11          72522       72824     2433816   fd  Linux raid autodetect

Disk /dev/sdb: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          15      120456   fd  Linux raid autodetect
/dev/sdb2              16          16        8032+  83  Linux
/dev/sdb3              17       69039   554427247+  83  Linux
/dev/sdb4           69040       72824    30403012+   f  W95 Ext’d (LBA)
/dev/sdb5           69040       70344    10482381   fd  Linux raid autodetect
/dev/sdb6           70345       71649    10482381   fd  Linux raid autodetect
/dev/sdb7           71650       71910     2096451   fd  Linux raid autodetect
/dev/sdb8           71911       72171     2096451   fd  Linux raid autodetect
/dev/sdb9           72172       72432     2096451   fd  Linux raid autodetect
/dev/sdb10          72433       72521      714861   fd  Linux raid autodetect
/dev/sdb11          72522       72824     2433816   fd  Linux raid autodetect

/dev/sda and /dev/sdb each contain the same partitions with the same sizes. But lets look at the other 10 disks and see what they look like:

[root@dmorlcel05 ~]# fdisk -l /dev/sd[c-l]

Disk /dev/sdc: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn’t contain a valid partition table

Disk /dev/sdd: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn’t contain a valid partition table

Disk /dev/sde: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sde doesn’t contain a valid partition table

Disk /dev/sdf: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdf doesn’t contain a valid partition table

Disk /dev/sdg: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdg doesn’t contain a valid partition table

Disk /dev/sdh: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdh doesn’t contain a valid partition table

Disk /dev/sdi: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdi doesn’t contain a valid partition table

Disk /dev/sdj: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdj doesn’t contain a valid partition table

Disk /dev/sdk: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdk doesn’t contain a valid partition table

Disk /dev/sdl: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdl doesn’t contain a valid partition table

We can clearly see that disks 0 and 1 contain 11 partitions but that disks 2 through 11 do not. These partitions are what makes the first two disks on each cell into “system luns“. The total size used for this set of system partitions is seen by looking at the celldisks:

CellCLI> list celldisk attributes name,deviceName,devicePartition,diskType,size where diskType = HardDisk
         CD_00_dmorlcel05        /dev/sda        /dev/sda3       HardDisk        528.734375G
         CD_01_dmorlcel05        /dev/sdb        /dev/sdb3       HardDisk        528.734375G
         CD_02_dmorlcel05        /dev/sdc        /dev/sdc        HardDisk        557.859375G
         CD_03_dmorlcel05        /dev/sdd        /dev/sdd        HardDisk        557.859375G
         CD_04_dmorlcel05        /dev/sde        /dev/sde        HardDisk        557.859375G
         CD_05_dmorlcel05        /dev/sdf        /dev/sdf        HardDisk        557.859375G
         CD_06_dmorlcel05        /dev/sdg        /dev/sdg        HardDisk        557.859375G
         CD_07_dmorlcel05        /dev/sdh        /dev/sdh        HardDisk        557.859375G
         CD_08_dmorlcel05        /dev/sdi        /dev/sdi        HardDisk        557.859375G
         CD_09_dmorlcel05        /dev/sdj        /dev/sdj        HardDisk        557.859375G
         CD_10_dmorlcel05        /dev/sdk        /dev/sdk        HardDisk        557.859375G
         CD_11_dmorlcel05        /dev/sdl        /dev/sdl        HardDisk        557.859375G

We can see that about 29 gig is used for the system area on the first two hard disks. We can also see that the “devicePartition” attribute for the first two celldisks is the third partition of  the device but that for the other 10 disks, it corresponds to the device name. But now lets see how these partitions are really used. First lets try to see what is mounted:

[root@dmorlcel05 ~]# mount
/dev/md6 on / type ext3 (rw,usrquota,grpquota)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/md8 on /opt/oracle type ext3 (rw,nodev)
/dev/md4 on /boot type ext3 (rw,nodev)
/dev/md11 on /var/log/oracle type ext3 (rw,nodev)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
//10.141.138.97/monitor on /home/monitor type cifs (rw,mand)
/dev/sdm1 on /bla type ext3 (rw)
/dev/sdm1 on /mnt/usb type ext3 (rw)

The mounts show no evidence of partitions on /dev/sda or /dev/sdb. Instead we see various references to /dev/md partitions. We can also see this in /etc/fstab:

[root@dmorlcel05 ~]# cat /etc/fstab
/dev/md6           /                       ext3    defaults,usrquota,grpquota        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/md2              swap                    swap    defaults        0 0
/dev/md8                /opt/oracle             ext3    defaults,nodev  1 1
/dev/md4                /boot                   ext3    defaults,nodev  1 1
/dev/md11               /var/log/oracle         ext3    defaults,nodev  1 1
//10.141.138.97/monitor     /home/monitor           cifs    user=monitor,password=monitor,uid=root,gid=root 0 0

We can see the following:

  • the swap partition is /dev/md2
  • the boot file system is mounted on /dev/md4
  • the root file system on /dev/md6
  • the oracle cell software base is mounted on /dev/md8
  • the oracle cell software log file system is mounted on /dev/md11

Also we already know the following:

  • /dev/sda3 and /dev/sdb3 are used for the celldisks on the first two luns. We saw this in the celldisk listing above. There is no corresponding metadevice.
  • /dev/sda4 and /dev/sdb4 are just extended partitions.
  • /dev/sda1 and /dev/sdb1 are bootable partitions from the fdisk output we saw earlier

To understand the use of the mounted devices, note that many of the partitions of the first two disks are managed with software raid mirroring using mdadm. Lets first see all the metadevices:

[root@dmorlcel05 grub]# ls -als /dev/md*
0 brw-r—– 1 root disk 9,  0 Aug 27 02:24 /dev/md0
0 brw-r—– 1 root disk 9,  1 Aug 27 02:22 /dev/md1
0 brw-r—– 1 root disk 9, 11 Aug 27 02:24 /dev/md11
0 brw-r—– 1 root disk 9,  2 Aug 27 02:22 /dev/md2
0 brw-r—– 1 root disk 9,  4 Aug 27 02:24 /dev/md4
0 brw-r—– 1 root disk 9,  5 Aug 27 02:27 /dev/md5
0 brw-r—– 1 root disk 9,  6 Aug 27 02:27 /dev/md6
0 brw-r—– 1 root disk 9,  7 Aug 27 02:27 /dev/md7
0 brw-r—– 1 root disk 9,  8 Aug 27 02:27 /dev/md8

We can see that there are metadevices for most of the partitions but not for:

  • /dev/sda3 which is used for the first celldisk
  • /dev/sdb3 which is used for the second celldisk

Lets see what the software raid setup is like using mdadm first with /dev/md2 which we saw earlier was used for the swap partition. First lets confirm this using swapon.

[root@dmorlcel05 grub]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/md2                                partition       2096376 0       -1

Now lets see what /dev/md2 is:

root@dmorlcel05 grub]# mdadm --misc -D /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Thu Feb  4 06:03:17 2010
     Raid Level : raid1
     Array Size : 2096384 (2047.59 MiB 2146.70 MB)
  Used Dev Size : 2096384 (2047.59 MiB 2146.70 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Sun Oct  9 04:22:45 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 3db204f1:d7a6a84b:cd086034:be3fb671
         Events : 0.58

    Number   Major   Minor   RaidDevice State
       0       8        9        0      active sync   /dev/sda9
       1       8       25        1      active sync   /dev/sdb9

We can see that /dev/md2 is mirroring across partitions /dev/sda9 and /dev/sdb9.

Lets see about some other partitions, firstly the boot file system on /dev/md4:

[root@dmorlcel05 grub]# mdadm --misc -D /dev/md4
/dev/md4:
        Version : 0.90
  Creation Time : Thu Feb  4 06:03:28 2010
     Raid Level : raid1
     Array Size : 120384 (117.58 MiB 123.27 MB)
  Used Dev Size : 120384 (117.58 MiB 123.27 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 4
    Persistence : Superblock is persistent

    Update Time : Fri Oct 14 06:44:52 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : e325de40:410dd5d7:f477be17:f36a61aa
         Events : 0.64

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

We can see that the boot file system is also mirrored across /dev/sda1 and /dev/sdb1. Next is the root file system:

[root@dmorlcel05 grub]# mdadm --misc -D /dev/md6
 /dev/md6:
         Version : 0.90
   Creation Time : Thu Feb  4 06:03:41 2010
      Raid Level : raid1
      Array Size : 10482304 (10.00 GiB 10.73 GB)
   Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
    Raid Devices : 2
   Total Devices : 2
 Preferred Minor : 6
     Persistence : Superblock is persistent
    Update Time : Fri Oct 14 07:15:20 2011
           State : clean
  Active Devices : 2
 Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0
           UUID : 73867382:a65ee3ec:8bc6831d:a67e037c
          Events : 0.6
    Number   Major   Minor   RaidDevice State
        0       8        6        0      active sync   /dev/sda6
        1       8       22        1      active sync   /dev/sdb6

Next the Oracle base file system on /dev/md8:

[root@dmorlcel05 grub]# mdadm --misc -D /dev/md8
/dev/md8:
        Version : 0.90
  Creation Time : Thu Feb  4 06:04:15 2010
     Raid Level : raid1
     Array Size : 2096384 (2047.59 MiB 2146.70 MB)
  Used Dev Size : 2096384 (2047.59 MiB 2146.70 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 8
    Persistence : Superblock is persistent

    Update Time : Fri Oct 14 07:16:41 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 353f0c30:88a291fa:36b88fb4:d274f195
         Events : 0.60

    Number   Major   Minor   RaidDevice State
       0       8        8        0      active sync   /dev/sda8
       1       8       24        1      active sync   /dev/sdb8

Finally we have the log filesystem:

[root@dmorlcel05 grub]# mdadm --misc -D /dev/md11
/dev/md11:
        Version : 0.90
  Creation Time : Tue Aug 24 15:55:27 2010
     Raid Level : raid1
     Array Size : 2433728 (2.32 GiB 2.49 GB)
  Used Dev Size : 2433728 (2.32 GiB 2.49 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 11
    Persistence : Superblock is persistent

    Update Time : Fri Oct 14 07:19:15 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : f04ecbf4:58fdc681:0e1668d1:9049b7b4
         Events : 0.66

    Number   Major   Minor   RaidDevice State
       0       8       11        0      active sync   /dev/sda11
       1       8       27        1      active sync   /dev/sdb11 

We can see that the root, boot, cell software, cell log and swap partitions are mirrored onto partitions of the first
two disks using mdadm. But this leaves some partitions on each of the disks unaccounted for. For example we see no evidence of  /dev/md5 or /dev/md7 and the partitions over which they might be created.

To understand this we must see how cell get imaged. A  cell contains both the current image and the previous image of root and software file systems as well as different kernels. When a new image is “patched” onto the cell, the “inactive” partitions on /dev/sda and /dev/sdb get updated with the new image and updates are made to /etc/fstab and to the grub.conf file for booting the new images kernel. When the cell is rebooted, the new image is the active image and the previous active image becomes the inactive one. We can explore this using some utilities used on the database machine.

[root@dmorlcel05 grub]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.4.el5 #1 SMP Sat Feb 19 03:38:37 EST 2011 x86_64
Cell version: OSS_11.2.0.3.0_LINUX.X64_110429.1
Cell rpm version: cell-11.2.2.3.1_LINUX.X64_110429.1-1

Active image version: 11.2.2.3.1.110429.1
Active image created: 2011-04-29 21:20:48 -0700
Active image activated: 2011-05-09 16:26:31 -0400
Active image type: production
Active image status: success
Active internal version:
Active image label: OSS_11.2.0.3.0_LINUX.X64_110429.1
Active node type: STORAGE
Active system partition on device: /dev/md6
Active software partition on device: /dev/md8

In partition rollback: Impossible

Cell boot usb partition: /dev/sdm1
Cell boot usb version: 11.2.2.3.1.110429.1

rmdir: /mnt/dev/md5: Directory not empty
Inactive image version: 11.2.1.2.6
Inactive image created: 2010-05-11 13:37:31 -0700
Inactive image activated: 2010-10-05 02:34:01 -0400
Inactive image type: production
Inactive image status: success
Inactive internal version: 999999
Inactive image label: OSS_11.2.1.2.6_LINUX.X64_100511
Inactive node type: STORAGE
Inactive system partition on device: /dev/md5
Inactive software partition on device: /dev/md7

Boot area has rollback archive for the version: 11.2.1.2.6
Rollback to the inactive partitions: Possible

Here we can see that /dev/md5 and /dev/md7 are “inactive. Each of these metadevices is mirrored across partitions on /dev/sda and /dev/sdb. Note that there is also a utility to show all the images that have been on the cell:

[root@dmorlcel05 grub]# imagehistory -a
Version                              : 11.2.1.2.1
Image activation date                : 2010-02-04 15:00:53 -0800
Imaging mode                         : fresh
Imaging status                       : success
Imaging type                         : image
Image creation date                  : 2010-02-01 02:52:14 -0800
Image type                           : production
Image node                           : STORAGE
Internal version                     : 100131
Internal label                       : OSS_11.2.1.2.1_LINUX.X64_100131
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1265021534

Version                              : 11.2.1.2.3
Image activation date                : 2010-04-30 12:12:04 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-03-05 11:35:15 -0800
Image type                           : production
Image node                           : STORAGE
Internal version                     : 999999
Internal label                       : OSS_11.2.1.2.3_LINUX.X64_100305
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1267817715

Version                              : 11.2.1.2.4
Image activation date                : 2010-04-30 12:21:54 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-04-13 21:53:04 -0700
Image type                           : production
Image node                           : STORAGE
Internal version                     : 999999
Internal label                       : OSS_11.2.1.2.4_LINUX.X64_100413
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1271220784

Version                              : 11.2.1.3.1
Image activation date                : 2010-08-24 18:13:07 -0400
Imaging mode                         : out of partition upgrade
Upgrade logs                         : /var/log/cellos/patchlogs/1286251665.ex_part_rollback.11.2.1.2.4_11.2.1.3.1/1286251665.ex_part_rollback.11.2.1.2.4_11.2.1.3.1.tar.gz
Imaging status                       : success
Imaging type                         : out of partition
Image creation date                  : 2010-08-18 21:52:15 -0700
Image type                           : production
Image node                           : STORAGE
Internal version                     : 100818.1
Internal label                       : OSS_11.2.1.3.1_LINUX.X64_100818.1
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1282193535

Version                              : 11.2.1.2.4
Image activation date                : 2010-10-05 00:24:08 -0400
Imaging mode                         : out of partition rollback
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-04-13 21:53:04 -0700
Image type                           : production
Image node                           : STORAGE
Internal version                     : 999999
Internal label                       : OSS_11.2.1.2.4_LINUX.X64_100413
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1271220784

Version                              : 11.2.1.2.6
Image activation date                : 2010-10-05 02:34:01 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-05-11 13:37:31 -0700
Image type                           : production
Image node                           : STORAGE
Internal version                     : 999999
Internal label                       : OSS_11.2.1.2.6_LINUX.X64_100511
Internal cell rpm version            : undefined
Internal cell software version       : undefined
Image id                             : 1273610251

Version                              : 11.2.2.3.1.110429.1
Image activation date                : 2011-05-09 16:26:31 -0400
Imaging mode                         : out of partition upgrade
Imaging status                       : success
Imaging type                         : out of partition
Image creation date                  : 2011-04-29 21:20:48 -0700
Image type                           : production
Image node                           : STORAGE
Internal version                     : undefined
Internal label                       : OSS_11.2.0.3.0_LINUX.X64_110429.1
Internal cell rpm version            : OSS_11.2.2.3.1_LINUX.X64_110429.1
Internal cell software version       : OSS_11.2.0.3.0_LINUX.X64_110429.1
Image id                             : 1304137248

Note: the imageinfo and imagehistory may also be used on the database nodes in a database machine but the output is somewhat different for imageinfo. Here is imageinfo on a database server node ni the database machine:

[root@dmorldb03 ~]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.4.el5 #1 SMP Sat Feb 19 03:38:37 EST 2011 x86_64
Image version: 11.2.2.3.5.110815
Image created: 2011-08-16 13:35:02 -0700
Image activated: 2011-10-08 17:28:56 -0400
Image image type: production
Image status: success
Internal version:
Image label: OSS_11.2.2.3.5_LINUX.X64_110815
Node type: COMPUTE
System partition on device: /dev/sda1

And here is the imagehistory output on a database node:

[root@dmorldb03 ~]# imagehistory -a
Version                              : 11.2.1.2.1
Image activation date                : 2010-02-03 22:59:09 -0800
Imaging mode                         : fresh
Imaging status                       : success
Imaging type                         : image
Image creation date                  : 2010-02-01 02:53:01 -0800
Image type                           : production
Image node                           : COMPUTE
Internal version                     : 100131
Internal label                       : OSS_11.2.1.2.1_LINUX.X64_100131
Image id                             : 1265021581

Version                              : 11.2.1.2.3
Image activation date                : 2010-04-30 14:29:15 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-03-05 11:38:07 -0800
Image type                           : production
Image node                           : COMPUTE
Internal version                     : 999999
Internal label                       : OSS_11.2.1.2.3_LINUX.X64_100305
Image id                             : 1267817887

Version                              : 11.2.1.3.1
Image activation date                : 2010-08-24 18:50:13 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2010-08-18 21:49:31 -0700
Image type                           : production
Image node                           : COMPUTE
Internal version                     : 100818.1
Internal label                       : OSS_11.2.1.3.1_LINUX.X64_100818.1
Image id                             : 1282193371

Version                              : 11.2.2.3.1.110429.1
Image activation date                : 2011-05-09 17:58:47 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2011-04-29 21:14:31 -0700
Image type                           : production
Image node                           : COMPUTE
Internal version                     : undefined
Internal label                       : OSS_11.2.0.3.0_LINUX.X64_110429.1
Image id                             : 1304136871

Version                              : 11.2.2.3.5.110815
Image activation date                : 2011-10-08 16:15:41 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2011-08-16 13:35:02 -0700
Image type                           : production
Image node                           : COMPUTE
Internal version                     : undefined
Internal label                       : OSS_11.2.2.3.5_LINUX.X64_110815
Image id                             : 1313526902

Version                              : 11.2.2.3.1.110429.1
Image activation date                : 2011-10-08 17:01:49 -0400
Imaging mode                         : rollback
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2011-04-29 21:14:31 -0700
Image type                           : production
Image node                           : COMPUTE
Internal version                     : undefined
Internal label                       : OSS_11.2.0.3.0_LINUX.X64_110429.1
Image id                             : 1304136871

Version                              : 11.2.2.3.5.110815
Image activation date                : 2011-10-08 17:28:56 -0400
Imaging mode                         : patch
Imaging status                       : success
Imaging type                         : in partition
Image creation date                  : 2011-08-16 13:35:02 -0700
Image type                           : production
Image node                           : COMPUTE
Internal version                     : undefined
Internal label                       : OSS_11.2.2.3.5_LINUX.X64_110815
Image id                             : 1313526902

I hope this helps in the understanding of how cells maintain their images!

Joel

October 2011


Posted in Oracle | 8 Comments »

Train Your DBAs – From DBATRAIN

Posted by Joel Goodman on 12/10/2011

Last chance to contribute to the footprint of the new Oracle DBA training curriculum and certification track footprint.

The Job Task Analysis project survey closes at end of the month and will be used to help define the structure and content of the Oracle University curriculum for the next release. This covers the core DBA training curriculum, advanced topics such as Grid Infrastructure, RAC, Data Guard, Partitioning, Exadata Database Machine, Performance Management and much more.

There will of course be a certification track for OCA and OCP based on the core curriculum and several Oracle Certified Expert (OCE) exams based on advanced topics, and of course the Oracle Certified Master (OCM) exam.

I will be meeting with Harald van Breederode, Uwe Hesse and Branislav Valny from the Oracle University EMEA SME team, and with the DBA curriculum Managers from Oracle University to help create the blueprint for the next generation of training and certification paths.

Share your views by taking  this online survey and tell us what tasks are important to you.

Your input will assure that Oracle University courses and certification tracks, reflect the evolving skill set and job requirements of today’s DBA 2.0

Joel

October 2011

Posted in Oracle | Leave a Comment »

Calling All DBAs to Survey Their Tasks

Posted by Joel Goodman on 09/09/2011

You may have been to an Oracle DBA class, or done your certification. Perhaps you have considered doing so.

Either way, you can contribute to the definition of  the depth and scope of future Oracle Database training courses and Oracle certification exams?

Share your views by taking  this online survey and tell us what tasks are important to you.

Your input will assure that Oracle University courses and certification tracks, reflect the evolving skill set and job requirements of today’s DBA 2.0

Joel

September 2011

Posted in Oracle | Leave a Comment »

Control How You Share Your Snaps

Posted by Joel Goodman on 01/09/2011

On a recent Oracle Certified Masters (OCM) prep workshop, a question arose concerning the purpose and use of the RMAN Snapshot control file. This in turn led to some interesting exploration of a change in Oracle 11g R2 beginning in 11.2.0.2.

To understand this change it is first a good idea to review the snapshot controlfile which in turn requires an understanding of RMAN Resync operations.

 

RMAN Resync Operations
RMAN performs Resync operations when the RMAN Client is connected to a target and an RMAN catalog. There are also two ways for Resync to be invoked:

  • Manual Resync – using the RESYNC CATALOG command;
  • Automatic Resync – this may be partial or full as explained below.

There are two types of Resync:

  • Full Resync
  • Partial Resync

To understand when these are done, consider that the controlfile contains certain records that are able to be overwritten based on the CONTROLFILE_RECORD_KEEP_TIME parameter. These are records in the controlfile, that contain details of archivelogs produced and which may be overwritten once those details are copied to the catalogue. But many of the other controlfile record types are not overwritten as they represent persistent configuration details. These records include but are not restricted to things such as:

  • Database Record
  • Redo Thread Records
  • Log File Records
  • Checkpoint Progress Records
  • Tablespace Records
  • Filename Records
  • RMAN Persistent Configuration Records

Details of all the record types, including the number of records per type is available in the V$CONTROLFILE_RECORD_SECTION view.

When changes occur only to the records that may be overwritten, then a Partial Resync is performed automatically the next time that the RMAN client connects to the target and the catalogue and does any of the following:

  • Backup
  • Restore
  • Recover

A Full Resync occurs in the following cases:

  • When changes occur to any non-overwriteable section of the controlfile and any of the three commands listed above is issued
  • When a database is first registered in the catalogue using the REGISTER DATABASE command
  • When a RESYNC command is issued from the RMAN Prompt

Snapshot Controlfile Locations

Prior to Oracle 11.2.0.2, The Snapshot Controlfile was used only during a Full Resync to reduce the duration of the CF ENQUEUE held exclusively on the controlfile. The enqueue prevents updates to any of the control file records, because there are dependencies between records in some sections and others. For example, there are tablespace records and file records, and during a Full Resync, the consistency between these two record types must be maintained until all the controlfile records are resync’ed to the catalogue. This prevents Oracle processes making changes to anything in the controlfile that might cause the records to become inconsistent.

But the controlfile is frequently updated in a busy system, so instead of holding the CF ENQUEUE for the duration of the resync, the controlfile is essentially copied to the “Snapshot” controlfile location in one go whilst holding the CF ENQUEUE, and then it is dequeued. The catalogue is then resync’ed from the “Snapshot”.

By default, the “Snapshot” is written to the $ORACLE_HOME/dbs directory which may be seen by doing an RMAN SHOW ALL command. But this may be configured to another location if required.

In RAC for example, the directory for the snapshots, must be a valid path name on any node in the cluster which has an instance of the database, to guarantee that the snapshot may be created regardless of which instance is serving the RMAN client. But here is where the change occurs.

Prior to 11.2.0.2, the Snapshot Controlfile location did NOT require shared storage, only that the path was valid from all nodes as stated above. From 11.2.0.2, it must be in shared storage when using RAC. If not then one might see the following:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of Control File and SPFILE Autobackup command on
ORA_DISK_1 channel at 31/08/2011 13:34:07
ORA-00245: control file backup operation failed

What is going on?  The answer is that the Snapshot Controlfile is being used for ordinary controlfile backups from 11.2.0.2 for:

  • Control file backups taken manually from SQL*PLUS
  • Control file auto backups taken by the Oracle server when configured from the RMAN client

For example here is what happens if a control file backup is attempted in RAC in 11.2.0.2 when not using shared storage:

SQL> alter database backup controlfile to ‘/u01/app/oracle/product/11.2.0.2/dbhome_i/dbs/ctlbackup.ctl’ ;
ERROR at line 1:
ORA-00245: control file backup operation failed

The $ORACLE_HOME/dbs directory is normally local in RAC, since the entire ORACLE_HOME is on local storage. This facilitates rolling patches, but from 11.2.0.2 the dbs directory is no longer useful for the snapshots.

The reason for this change is that control file backups traditionally used the CF ENQUEUE to lock the controlfile for the duration of the backup but in some high workload systems this was a performance bottleneck. The solution, was to take the controlfile backup from the snapshot and avoid holding the CF ENQUEUE whilst doing it. But that required maintaining the snapshot as changes occur to the controlfile. Since changes could occur from any instance of the database, the snapshot must be on shared storage, just like the proper control file is.

Look at the OERR output for the error:

> oerr ora 00245

00245, 00000, “control file backup failed; target is likely on a local file system”
// *Cause: Failed to create a control file backup because some process
// signaled an error during backup creation. This is likely caused
// by the backup target being on a local file system so it could not
// be accessed by other instances. It can also be caused by other
// I/O errors to the backup target. Any process of any instance that
// starts a read/write control file transaction must have access
// to the backup control file during backup creation.
// *Action: Check alert files of all instances for further information.

When upgrading to 11.2.0.2 under RAC, make sure to avoid this problem by using RMAN to configure the Snapshot Controlfile to a shared location:

RMAN> CONFIGURE SNAPSHOT CONTROLFILE NAME TO ‘<a_shared_location>/snapcf_?.f’;

Joel

01/09/2011

Posted in Oracle | 1 Comment »

Add Bulk to your Parallel Updates

Posted by Joel Goodman on 01/07/2011

 

Oracle Database server 11gR2 added several new Parallel Execution features. I had several questions on a recent Exadata administration course concerning parallelism, and an interesting discussion arose about these new features. Although parallel execution is useful in all Oracle Database Data Warehouse implementations, and often in batch systems, it is particularly relevant to Exadata. In Exadata, “smart” storage operations, which offload filtering, column projections and some join overheads to the cells may be done for:

  • Full Table Scans
  • Fast Full Index Scans

But these “smart” scans, only occur if a “direct path read” access method is done. This occurs in the following situations:

  • On a Parallel Query
  • On a Serial Query – when the size of the table exceeds the “_small_table_threshold” hidden parameter
  • On a Serial Query – when “_serial_direct_read” hidden parameter is set to “always” regardless of table size

Note: both these parameters are session modifiable.

In addition for Exadata, “direct path insert” is required when loading rows into Hybrid Columnar Compressed (HCC) tables, in order to get the HCC compression done into multi-block compression units. Direct path insert may occur when:

  • When inserting in parallel
  • When doing “Create table as select” (CTAS) in parallel
  • When inserting serially using the “append” hint

Since Exadata is used for Data Warehouses, Parallel Execution is likely to be used very often, both for queries of large tables and for bulk inserts and updates of existing tables.

The new features of Parallel Execution in Oracle 11gR2 are:

  1. Automatic Degree of Parallelism – to set Degree of Parallelism automatically for a statement
  2. Parallel Statement Queuing – to queue a statement if inadequate execution slave processes are not available
  3. In Memory Parallel Updates – to utilise the RAC buffer caches for Parallel queries as an alternative to cache fusion overheads
  4. Parallel Bulk Updates – to allow updates to be done in sections which an succeed or fail independently of other sections

In this post I will focus on Parallel Bulk updates which:

  • Are useful for bulk updates of large data volumes
  • Avoids the “All or Nothing” approach of traditional Parallel DML
  • Groups sets of rows into smaller sized chunks
  • Chunk updates run by a scheduler job which commits on completion
  • Chunking is based on one of three possible techniques:
    • Using ROWID ranges of rows in the queried table
    • Ranges of values in a numeric column of  the queried table
    • Results of User provided SQL statements

Parallel Bulk update is done using the DBMS_PARALLEL_EXECUTE package, which permits the chunking of rows using any of the 3 methods listed. Then the updates are done with several benefits:

  • Only one chunk of rows need be locked at one time, rather than having the entire table locked as in normal PDML
  • Chunks commit independently of of one another
  • Failure of one chunk update to commit, does not cause all the updates to roll back
  • Less Undo space is required since the chunks may commit their updates independently
  • Performance may benefit from the combination of reduced simultaneous locking, and reduced undo consumption
  • Avoids having a huge rollback of a nearly complete bulk update due to failure near the end
  • Failed chunks may be retried after correcting the problem causing the failure

Here is the DBMS_PARALLEL_EXECUTE package described in Oracle 11.2.0.2:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL>  desc dbms_parallel_execute
PROCEDURE ADM_DROP_CHUNKS
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_OWNER                     VARCHAR2                IN
 TASK_NAME                      VARCHAR2                IN
PROCEDURE ADM_DROP_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_OWNER                     VARCHAR2                IN
 TASK_NAME                      VARCHAR2                IN
PROCEDURE ADM_STOP_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_OWNER                     VARCHAR2                IN
 TASK_NAME                      VARCHAR2                IN
FUNCTION ADM_TASK_STATUS RETURNS NUMBER
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_OWNER                     VARCHAR2                IN
 TASK_NAME                      VARCHAR2                IN
PROCEDURE CREATE_CHUNKS_BY_NUMBER_COL
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 TABLE_OWNER                    VARCHAR2                IN
 TABLE_NAME                     VARCHAR2                IN
 TABLE_COLUMN                   VARCHAR2                IN
 CHUNK_SIZE                     NUMBER                  IN
PROCEDURE CREATE_CHUNKS_BY_ROWID
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 TABLE_OWNER                    VARCHAR2                IN
 TABLE_NAME                     VARCHAR2                IN
 BY_ROW                         BOOLEAN                 IN
 CHUNK_SIZE                     NUMBER                  IN
PROCEDURE CREATE_CHUNKS_BY_SQL
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 SQL_STMT                       CLOB                    IN
 BY_ROWID                       BOOLEAN                 IN
PROCEDURE CREATE_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 COMMENT                        VARCHAR2                IN     DEFAULT
PROCEDURE DROP_CHUNKS
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
PROCEDURE DROP_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
FUNCTION GENERATE_TASK_NAME RETURNS VARCHAR2
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 PREFIX                         VARCHAR2                IN     DEFAULT
PROCEDURE GET_NUMBER_COL_CHUNK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 CHUNK_ID                       NUMBER                  OUT
 START_ID                       NUMBER                  OUT
 END_ID                         NUMBER                  OUT
 ANY_ROWS                       BOOLEAN                 OUT
PROCEDURE GET_ROWID_CHUNK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 CHUNK_ID                       NUMBER                  OUT
 START_ROWID                    ROWID                   OUT
 END_ROWID                      ROWID                   OUT
 ANY_ROWS                       BOOLEAN                 OUT
PROCEDURE PURGE_PROCESSED_CHUNKS
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
PROCEDURE RESUME_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 SQL_STMT                       CLOB                    IN
 LANGUAGE_FLAG                  NUMBER                  IN
 EDITION                        VARCHAR2                IN     DEFAULT
 APPLY_CROSSEDITION_TRIGGER     VARCHAR2                IN     DEFAULT
 FIRE_APPLY_TRIGGER             BOOLEAN                 IN     DEFAULT
 PARALLEL_LEVEL                 NUMBER                  IN     DEFAULT
 JOB_CLASS                      VARCHAR2                IN     DEFAULT
 FORCE                          BOOLEAN                 IN     DEFAULT
PROCEDURE RESUME_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 FORCE                          BOOLEAN                 IN     DEFAULT
PROCEDURE RUN_INTERNAL_WORKER
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 JOB_NAME                       VARCHAR2                IN
PROCEDURE RUN_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 SQL_STMT                       CLOB                    IN
 LANGUAGE_FLAG                  NUMBER                  IN
 EDITION                        VARCHAR2                IN     DEFAULT
 APPLY_CROSSEDITION_TRIGGER     VARCHAR2                IN     DEFAULT
 FIRE_APPLY_TRIGGER             BOOLEAN                 IN     DEFAULT
 PARALLEL_LEVEL                 NUMBER                  IN     DEFAULT
 JOB_CLASS                      VARCHAR2                IN     DEFAULT
PROCEDURE SET_CHUNK_STATUS
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
 CHUNK_ID                       NUMBER                  IN
 STATUS                         NUMBER                  IN
 ERR_NUM                        NUMBER                  IN     DEFAULT
 ERR_MSG                        VARCHAR2                IN     DEFAULT
PROCEDURE STOP_TASK
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN
FUNCTION TASK_STATUS RETURNS NUMBER
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TASK_NAME                      VARCHAR2                IN

To divide the table into chucks, one of three procedures may be used:

  1. DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_ROWID
  2. DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_NUMBER_COL
  3. DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_SQL

Note: use of this package attempts to create jobs, so using this requires the CREATE_JOB system privilege.

Using ROWID based chunks

To chunk by rowid,  requires the CHUNK_SIZE parameter to be passed in addition to the table owner and table name. The CHUNK_SIZE is the rough number of rows or blocks in each chunk and therefore in each separately committed transaction. The boolean parameter BY_ROW determines of CHUNK_SIZE refers to rows or blocks.

Using numeric column value based chunks

To chunk by numeric column, requires the TABLE_COLUMN parameter, which names the numeric column used to determine the chunks and the CHUNK_SIZE parameter. For this method, the minimum and maximum values in the column are obtained from the Data Dictionary, and the table is then chunked evenly by CHUNK_SIZE. For popular values, there could be more than one chunk, and conversely a chunk could contain multiple values.

Using user provided SQL based chunks

To chunk using SQL requires that the user executing the package have execute privileges for the DBMS_SQL package, as this is called by the DBMS_PARALLEL_EXECUTE package to execute the SQL statement used to do the chunking. This method requires the SQL_STATEMENT parameter containing the text of the statement whose output is used to chunk the data, either by rowids or by numeric columns. It uses the BY_ROWID parameter. If this is  TRUE, then the SQL statement must have columns called START_ID and END_ID of type ROWID. If BY_ROWID is false, then then the SQL statement must have columns called START_ID and END_ID of type NUMBER.

Using the package

The package is generally used as follows:

  1. Call CREATE_TASK to establish the task providing a name
  2. Chunk the table using one of the three methods described above
  3. Call RUN_TASK, providing the SQL used to update the table
  4. Call DROP_TASK when complete

Some other procedures in the package allow the following operations:

  • Stopping a task and its job slaves using STOP_TASK
  • Resuming a stopped task and its job slaves using RESUME TASK
  • Managing and inquiring about a task’s or chunk’s status using TASK_STATUS or SET_CHUNK_STATUS
  • the ADM procedures are for DBAs and permit stopping and dropping tasks of other users, dropping chunks of other users and checking the task status of other users.

Note: to use RUN_TASK or RESUME_TASK also requires execute privileges on DBMS_SQL.

For a full list of all the procedures see the documentation here.The doc also contains some useful examples using the HR schema.

My Oracle support note 1066555.1 is also a useful reference.
Dictionary views used for this feature are:

  • DBA_PARALLEL_EXECUTE_TASKS
  • DBA_PARALLEL_EXECUTE_CHUNKS
  • USER_PARALLEL_EXECUTE_TASKS
  • USER_PARALLEL_EXECUTE_CHUNKS

Here are the DBA views:

SQL> desc dba_parallel_execute_tasks

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 TASK_OWNER                                NOT NULL VARCHAR2(30)
 TASK_NAME                                 NOT NULL VARCHAR2(128)
 CHUNK_TYPE                                         VARCHAR2(12)
 STATUS                                             VARCHAR2(19)
 TABLE_OWNER                                        VARCHAR2(30)
 TABLE_NAME                                         VARCHAR2(30)
 NUMBER_COLUMN                                      VARCHAR2(30)
 TASK_COMMENT                                       VARCHAR2(4000)
 JOB_PREFIX                                         VARCHAR2(30)
 SQL_STMT                                           CLOB
 LANGUAGE_FLAG                                      NUMBER
 EDITION                                            VARCHAR2(30)
 APPLY_CROSSEDITION_TRIGGER                         VARCHAR2(30)
 FIRE_APPLY_TRIGGER                                 VARCHAR2(10)
 PARALLEL_LEVEL                                     NUMBER
 JOB_CLASS                                          VARCHAR2(30)

SQL> desc dba_parallel_execute_chunks
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CHUNK_ID                                  NOT NULL NUMBER
 TASK_OWNER                                NOT NULL VARCHAR2(30)
 TASK_NAME                                 NOT NULL VARCHAR2(128)
 STATUS                                             VARCHAR2(20)
 START_ROWID                                        ROWID
 END_ROWID                                          ROWID
 START_ID                                           NUMBER
 END_ID                                             NUMBER
 JOB_NAME                                           VARCHAR2(30)
 START_TS                                           TIMESTAMP(6)
 END_TS                                             TIMESTAMP(6)
 ERROR_CODE                                         NUMBER
 ERROR_MESSAGE                                      VARCHAR2(4000)

Have a go at reducing the bulk of your updates!
Joel
July 2011


			

Posted in Oracle | 2 Comments »

“Plugging” in the Database Machine

Posted by Joel Goodman on 06/06/2011

I am often asked about monitoring and management when teaching The Exadata Database Machine administration course, both to customers and to Oracle internal staff. This is a popular topic due to the nature of the Database Machine, the monitoring of which requires knowledge and skill in several areas. These include:

  • database administration
  • storage administration
  • operating system administration
  • network administration
  • hardware knowledge

Some monitoring and performance management may be done using command line tools. Some of these may be familiar to DBAs whilst others are more familiar to OS and network administrators. The course discusses there only briefly is focuses primarily on concepts, architecture, and cell administration. These include:

  • the cellcli utility to monitor the Exadata cells and to set thresholds for alerts.
  • the Automated Diagnostics Repository (ADR) and the adrci command line tool for managing and monitoring traces, dumps and alert logs on database servers and on cells.
  • snmp traps to receive alerts from, and to configure device parameters on the database server
  • snmp traps to receive alerts from, and to configure device parameters on the Exadata cells
  • snmp traps to receive alerts from, and to configure device parameters on the Ethernet switch
  • snmp traps to receive alerts from, and to configure device parameters on the Infiniband switch
  • snmp traps to receive alerts from, and to configure device parameters on the Keyboard, Video and Mouse (KVM) switch
  • snmp traps to receive alerts from, and to configure device parameters on the Power Distribution Unit (PDU)
  • ILOM command line and GUI interfaces to monitor hardware, control power state and provide remote console for Exadata cells and DB servers

Monitoring may be made easier however, by using some or all of the Enterprise Manager Plug-ins for the Database machine.Each plug-in contains logic for a specific DBM component and extends the GUI interface of the Enterprise Manager Oracle Management Server (OMS) and the Enterprise Management Agent to manage the custom targets contained in a Database Machine.

The plug-ins available are:

  • Exadata Storage Server or Cell plug-in
  • Database Server ILOM plug-in
  • Infiniband Switch System Monitoring plug-in
  • Cisco Switch System Monitoring plug-in
  • Avocent MergePoint (KVM) plug-in
  • Power Distribution Unit (PDU)  plug-in

Enterprise Manager is well suited to this type of monitoring, as one may define an Aggregate Service comprising the various targets in a Database Machine.In an X2-2 full rack, there are more than 60 targets and having a single “dashboard” to represent the entire Database Machine simplifies monitoring. Each type of component may be grouped into an abstract collection called a System, and then the Systems are aggregated together into the Aggregate Service. Finally a monitoring “dashboard“, based on the Aggregate Service is created to facilitate monitoring from a single screen in Enterprise Manager.

Here is a brief summary of each plug-in, discussing the purpose and general use.

1. Exadata Storage Server plug-in

This extends  the monitoring of exadata cells in addition to providing a GUI interface. The plug-in uses an SSH connection to the cellmonitor user on the cells and uses list commands only. This is for interactive monitoring. One may also set thresholds using the plug-in which are distinct from any thresholds set using cellcli utility as the celladmin user. For alerts to be sent to the plug-in, SNMP traps are used as follows:

  1. Cell ILOM alerts are sent to the cell Management Server (MS) via an SNMP trap. The MS then send SNMP notifications onward to the plug-in.
  2. Cell alerts flagged by MS itself, such as cell thresholds being exceeded, or ADR software alerts, are sent to the plug-in using SNMP.

Note: cells may also send SNMP notifications to multiple locations including third party tools. This may be configured by the celladmin user using the cellcli utility, or by running the distributed command line interface (dcli) utility.

2.  Database Server ILOM plug-in

Monitoring databases and their instances, ASM environments, the Grid Infrastructure, and the host software environment are done by Enterprise Manager in the usual way as these are standard targets. But monitoring the hardware for the database servers requires the ILOM plug-in, as there is no Management Server (MS) on the database servers to receive SNMP traps from the ILOM. The plug-in will receive sensor state and availability data from the ILOM including alerts based on pre-set ILOM thresholds.

3.  Infiniband Switch System Monitoring plug-in

Infiniband switches running software from version 1.1.3 onwards can send SNMP data and the plug-in will show if the switches are up and if any alerts have been sent. If necessary one may then ssh to the switch ILOM and use native command line tools to diagnose problems.

Note: There are also numerous command line tools on the database servers and cells to check the infiniband ports, which are beyond the scope of this overview.

4. Cisco Switch System Monitoring plug-in

Clients and middle tiers connect to the database instances on the Database nodes over  the Client Access Network. The Database nodes use the Infiniband network for the Cluster Interconnect, and Database nodes and Exadata Storage servers use the Infiniband network as the storage network. As the Cisco switch is used on DBM Management network, it is not a critical component in the same way as are the components already listed, but the switch is needed for admin and monitoring traffic in the DBM. All the ILOM SNMP traffic for example uses this network.

Monitoring of the switch using the plug-in is for switch availability and hardware sensor thresholds. The  default values are set in the plug-in to monitor for switch availability. SNMP traps are sent by the switch for asynchronous notification of alerts.

Note: Customers may replace the Cisco switch if they prefer as it is not considered a critical component but then they would be responsible for monitoring.

5. Avocent MergePoint (KVM) plug-in

All models of DBM have the KVM switch, except for the X2-8 which has no room for it in the rack. Console access to the machines on the x2-8 is done using the ILOM. The plug-in is simply to monitor for availability. Alerts are sent to the plug-in via SNMP traps.

Note: Customers may replace the KVM if they prefer but then they would be responsible for monitoring.

6. Power Distribution Unit (PDU)  plug-in

There are two PDUs per rack. Monitoring them using the plug-in is optional but if one wishes to do so, then each PDU will require an ethernet cable to connect to the PDU. On the X2-2 Full rack, all 48 ports on the Cisco switch are used, so a network drop from an external switch will be required to connect.  And it is good practice to do this on the X2-8 despite the availability of ports, because the failure of the PDU could also affect the Cisco switch in the DBM preventing monitoring and alerts from being sent on the management network. Remote monitoring using the plug-in allows one to observe the electric power used in the DBM and any alerts as well.  The factors that affect power consumption are:

  • Type of DBM – quarter, half or full rack
  • Version of hardware – V2, X2-2 or full rack X2-8
  • Type of power supply – low or high voltage; one or three phase power

Note: My Oracle Support note 1299851.1 covers the threshold settings for the PDUs depending on the variations in model and power.

Installing and configuring all these plug-ins can be time consuming, especially to people who  are new to Enterprise Manager, the Database Machine or both.  As of release 11.2.2.3.x of the Exadata software, an Enterprise Manager setup scripts are provided to help customers install and configure the monitoring tools and plug-ins. It contains:

  • A script to set up EM agents on the Database Machine
  • A script to set up a new Grid Control OMS and OMR for customers new to EM

These scripts will save time by:

  • Installing the latest Exadata DBM plug-ins
  • Discovering all Exadata DBM system targets
  • Configuring all the Exadata DBM targets.

The Plug-in bundle may be downloaded from http://www.oracle.com/technetwork/oem/grid-control/downloads/devlic-188770.html

Oracle University will soon offer a seminar on Exadata Database Machine Monitoring where the plug-ins mentioned here and various command line monitoring tools will be examined in detail.

Joel

June 2011

Posted in Oracle | Leave a Comment »

How Smart is Your Smart Scan?

Posted by Joel Goodman on 21/04/2011

I was teaching the Oracle Database Machine Administration Workshop in London last week, when an interesting discussion arose concerning Exadata Smart Scan processing when the following situations arise:

  • Migrated or Chained Rows
  • Consistent Read (CR) Requests

The discussion followed on from a situation that a delegate encountered when doing some routine maintenance on tables and indexes in a data warehouse stored in an Exadata based database. Here is the scenario:

  1. A certain warehouse table had a index dropped, prior to bulk update operations on the table during regular maintenance windows.
  2. After the maintenance, the index was re-created using CREATE INDEX….
  3. During the index creation the WAIT EVENT seen by monitoring V$SESSION is mostly for “direct path read”
  4. During a recent maintenance window, several new columns were added to the table for an application upgrade after dropping the index.
  5. The table was then bulk updated including values for the new columns.
  6. When the index was re-created, it took more than 100 times as long to finish as it normally did.
  7. The WAIT EVENT seen by monitoring V$SESSION  during the index creation is mostly for “cell single block physical read“.

The initial questions were:

  1. Why is this taking so long?
  2. Why are there so many “cell single block physical read” waits?
  3. Why not have “cell multiblock physical read” waits.

In order to understand the answer,  a few facts should be made clear about how Exadata processes query requests. Much has been written on blogs and in presentations about Exadata “Smart Scan” operations and other “offload” capabilities. But they tend to discuss queries without deep diving into the effects of updates or other less common scenarios, on the operation of the “Smart Scan“.

The best in depth treatment I have seen comes from a new book “Expert Oracle Exadata” written by my “Oak Table” colleagues  Tanel Poder, Kerry Osborne and Randy Johnson published by APRESS.

Smart Scan” operations only work for DIRECT PATH READs, because they return the results of queries from the Exadata cells and not block images. The decision to do  buffered reads or direct path reads, is made at run time, based on the size of the object, the number of blocks already in the buffer cache for that object and the size of the buffer cache to name some. Conventional buffered reads would require blocks to go to the buffer cache, meaning that whole block images would be required. So the index re-creation I mentioned did a “Smart Scan” against the underlying table under normal maintenance conditions. Creating an index accesses all table rows, but only the index key columns are required, so “Column Projection” filtering was done by the cells, which is one of the standard “Smart Scan” capabilities.

When columns were added to the table during the upgrade maintenance window, the ensuing bulk update of the table caused many rows to migrate from their original blocks, to other blocks in the same segment. These blocks could be in any part of the table segment, based on available free space and may be in ASM allocation units in the same cell or in a different cell. One may detect when migrated (or chained) rows are accessed, by examining the “table fetch continued row” value in v$sysstat or in an AWR report. In this case, the value increased by a huge number; in effect nearly every table block had one or more migrated rows.

The huge increase in index build time was caused by the inability of “Smart Scans” to do offload processing for migrated rows. Migrated rows are a special case of chained rows where the entire row is in a different block than where it was originally inserted, but still fits into a single block. The original row is reduced to a 3 byte row header, plus a six byte rowid.

When a “Smart Scan” detects a migrated row, it passes the rowid back to the database session and for all such rows, the database session will access it using a buffered single block read. Since nearly all the table blocks had migrated rows, the database session needed to issue I/O requests for nearly all  the blocks one at a time. This was the cause for the huge number of “cell single block physical read” wait events that were seen when monitoring the recreation of the index. This indicates that the presence of many migrated (or chained) rows will degrade performance  in Exadata by preventing “Smart Scans” from being so smart.  This situation arose on an index creation, but it would also keep occurring until the DBA deleted and reinserted the migrated rows in the usual manner to solve this problem.

Note:  There is an Exadata specific situation that causes migrated rows. When a row is updated in a Hybrid Columnar Compressed (HCC) table, then it is migrated to another block in the segment that is managed using “OLTP compression“. Any HCC Compression Unit (CU) containing at least one migrated row, will also cause the block containing that row to be accessed by the server using a “cell single block physical read“.

The discussion then evolved to other situations that might interfere with “Smart Scan” operations on objects for which “Smart Scans” are normally done and which might be caused by updates. Consistent Read requests can force “Smart Scans” to work differently if any of the following is true:

  1. A block is updated and the update is not yet committed.
  2. A block is updated and committed, but the query System Change number (SCN) is older than the commit SCN.
  3. A block is updated and committed but not cleaned out, because it was no longer in the buffer cache at commit time.

In the first two cases, a CR image of the block must be found in the buffer cache of the instance issuing the read, or by RAC cache fusion in the case of RAC databases. If not found, then one must be constructed by accessing the Undo segment for the active transaction and creating an older block image by reference to the Undo blocks.  Whichever is done, requires that the CR processing be done in the database instances not the cells. So each cell performing part of the “Smart Scan” operation on part of the table may stop sending results temporarily and ship block images for any such block that it finds, for CR processing in  the database instance(s), before resuming smart scan on the next block. Performance will be more affected if there are many such blocks.

In the third case, when a cell sends its first block image to the database instance. the Undo will indicate that the transaction committed prior to the query SCN and the block may be cleaned out. The transaction ID may then be returned to the cell indicating that it was already committed. Any future references by any query to any blocks showing the same transaction to be active, and  whose query SCN is older than the commit SCN, may be cleaned out in the cell. This is done by reference to a cached list of committed transaction identifiers maintained in each cell. This means that the overhead of delayed block cleanout is not as serious as the case of an uncommitted transaction or one that committed after a query began, as outlined above.

The lessons learned from this are:

  • Migrated rows still affect performance, as they do in conventional storage situations, but with the additional overhead of reducing the beneficial effects of “Smart Scan” in addition to increasing the number of I/Os.
  • All queries may be affected by the migrated rows, including queries performed implicitly for DDL.
  • Uncommitted bulk updates performed at the same time as queries, may increase overheads due to Consistent Read processing
  • Bulk updates committed after a query starts, may cause the same CR processing overheads

DBAs looking after database on a DB machine should monitor for migrated rows and rectify the situation. There are numerous posts outlining the method for doing this, so i wont reproduce them here. Alternatively the Enterprise Manager Reorg Wizard may be used and details may be found in the EM documentation for the tuning pack reorg wizard which is a cost option.

For optimal “Smart Scan” efficiency, it is advisable to avoid warehouse queries done at the same time as bulk updates, as is the case with conventional storage solutions.

Joel

London 04/11

Posted in Oracle | 2 Comments »

Mirror Mirror on the Exadata….

Posted by Joel Goodman on 28/03/2011

In past posts, I have discussed several aspects of the Oracle Exadata Database Machine, most recently concerning the methods used by Advanced Customer Services (ACS) to configure the network and software. But ASM must be used for Exadata and requires the administrator to use ASM redundancy, because Exadata does not do Array Based Mirroring.

The feedback I get in the classroom and when presenting at user group conferences on either ASM or on Exadata is that customers don’t use ASM mirroring because the ASM Disks are LUNS on storage arrays and the mirroring is done by the storage system. Most ASM diskgroups therefore, use external redundancy where by default each ASM disk is considered a separate failgroup.

With Exadata, ASM automatically groups griddisks discovered on the same Exadata storage server into a single failgroup. For example, there are 14 storage servers (also known as cells) in an X2-2 or X2-8 full rack, and each cell would be considered a separate failgroup by ASM, assuming that at least one griddisk was discovered by ASM on each cell. No failgroup clauses are required when creating the diskgroup. The path name for the Grid Disks discovered in the cells have the following format:

0/<cell ip address>/<grid disk name>

These path names may be seen in one of the following ways:

  1. By querying the PATH column from V$ASM_DISK
  2. By using the LSDSK command from the ASMCMD tool
  3. By using Enterprise Manager ASM pages.

Here is an example of using V$ASM_DISK:

SQL> select path
            from v$asm_disk
            where path like 'o/%/%';
PATH
------------------------------------------------------------------------------
o/192.168.10.20/data_CD_04_mycell1
o/192.168.10.20/data_CD_05_mycell1
o/192.168.10.21/reco_CD_01_mycell2
o/192.168.10.21/reco_CD_02_mycell2

Note that the ip address in the path name, is the address of the cell on the infiniband storage network within the database machine. The grid disk name is chosen by the administrator, when creating the grid disk on an exadata cell disk. Alternatively, the name may be system generated, by concatenating an administrator specified prefix to the name of the cell disk on which the grid disk resides. Since all grid disks on the same cell  share the same ip address in their path name, it is easy for ASM to group the grid disks into failgroups.

Unlike diskgroups based on SAN provided LUNs, the cells do not mirror, and to provide for redundancy, the ASM diskgroups, must be created using normal or high redundancy so that ASM performs the mirroring. But there are some important implications for administering diskgroups when the mirroring is done by ASM relating to failure scenarios:

  • Single disk failure in a single cell
  • Multiple disk failure in a single cell
  • Single cell failure
  • Overlapping disk failures in multiple cells
  • Pro-Active Disk Quarantine
  • Required mirror free space to accommodate failures

1. Loss of a single disk in a cell

Loss of a disk means that any griddisk on that physical device will go offline within the diskgroup containing that griddisk. ASM normally controls the disposition of the offline disk based on an the disk repair time attribute of the diskgroup, which defaults to 3.6 hours. After this time, the disk if not placed back online, will be dropped and the failgroup to which it belongs will get rebalanced. This involves copying all primary or secondary ASM allocation units from other failgroups, to the failgroup that has lost a disk and also re-balancing the content of the failgroup so that the asm disks have roughly the same fullness percentage.

The delay specified by disk repair time is an 11g ASM feature to help avoid a re-balance operation at disk drop and then another, once the disk is repaired and added back to the failgroup.

2. Multiple disk failures in a single cell

This is similar to the loss of a single disk, except that there are more allocation units to recover, and less space remaining in the failgroup in which to copy  the missing allocation units from other failgroups.

3. Single cell failure

In this case,  the failgroup is entirely off line, and the missing allocation units must be copied from other failgroups to space on surviving failgroups.

4. Overlapping disk failures in multiple cells

In this case,  it is possible that the second disk failure may occur before the allocation units lost in the first disk failure have been successfully copied back to the failgroup suffering the first disk failure. If any allocation unit lost due to the first disk failure has its secondary copy on the disk which failed in the second diskgroup, then the mirror copies will have both been lost, and data loss will have occurred. If the diskgroup uses high redundancy instead of normal redundancy, then three copies of each allocation unit will exist, and the chances of the loss of all copies of any one AU is reduced, at the cost of an extra 50% overhead in disk space.

5. Pro-Active Disk Quarantine

The disk repair time feature mentioned above was introduced as a diskgroup attribute in Oracle 11g to cut possible  re-balance overheads, but it increases the risk that all copies of certain allocation units may be  lost if overlapping disk failures on different cells occur. Delaying the disk drop, increases the chances of another disk failure occurring before the lost AUs are copied.

Patch 11.2.1.3.1 introduced  Pro-Active Disk Quarantine, which overrides the  disk repair time property of a diskgroup when the diskgroup is based on Exadata griddisks. This forces the griddisk to be dropped from the diskgroup immediately, so that copying and re-balancing can be performed as soon as possible.  This reduces the exposure time for another disk failure in another failgroup, to cause the loss of the second copy of an AU, whose first copy has not yet been remirrored.

This means that ASM diskgroups now react differently to the loss of an ASM disk, if Exadata is used, compared to the behaviour when the ASM disks are based on other storage technologies.

6. Required Mirror Free Space to Accommodate Disk Failures

As mentioned earlier, many ASM administrators have little or no experience with ASM mirroring, because they use External Redundancy for their diskgroups and ASM generally maintains one copy for each AU in this case. tracking free space within a diskgroup in such a case is simple. Use the following details from view V$ASM_DISKGROUP to examine the redundancy, offline state and free space requirements for mirror recovery:

  • TOTAL_MB indicates the total space in the diskgroup
  • FREE_MB indicates the total free space  in the diskgroup
  • REQUIRED_MIRROR_FREE_MB indicates free space required in the diskgroup to restore redundancy by copying allocation units as described above.
  • USABLE_FILE_MB indicates how much of the FREE_MB may be safely used whilst leaving enough free space for mirror copy recovery in the case of disk failure
  • TYPE indicates the redundancy attribute of the diskgroup.
  • OFFLINE_DISKS indicates how many disks are offline in the diskgroup

Note: the LSDG command in the ASMCMD utility will provide the same information.

The crucial aspect of administering the diskgroups when ASM mirrors the AUs, is having enough free space to re-copy the AUs when a loss occurs. This is not a concern when external redundancy is used, but it is for normal and high redundancy,  and requires that the ASM administrator be aware of the free space needs.

For the database machine, the implications depend on the level of redundancy and also on the type of rack. For example,  a quarter rack has only three cells and therefore three failgroups. The worst case scenario would be loss of an entire cell whereby the two surviving cells would require enough free space to recover all the missing AU copies from the lost cell in order to restore redundancy. On a half rack, which contains seven cells, the overhead of the required free space is spread across six surviving cells rather than just two, and on a full rack, which contains fourteen cells, there are thirteen survivors in the case of the loss of a complete cell and failgroup.

Most ASM administrators who have attended the Oracle Database Machine and Exadata seminar or course, have realised that they must review their existing ASM skills and add ASM redundancy management to their skillset.

Joel

03/2011

Posted in Oracle | 9 Comments »

Installing a Clean DB Machine

Posted by Joel Goodman on 25/02/2011

In past posts, I have written about some of the functionality of Exadata and recently, my colleague Uwe Hesse has added some very fine new posts about specific Exadata features. This post discusses the installation of the Oracle Database Machine at a customer location, as carried out by a combination of Oracle Sun field engineers and Oracle Advanced Customer Services (ACS) engineers.

The planning stages for installation of a Database machine involve the following requirements:

  • Environmental setup for space and flooring
  • Ventilation, humidity and temperature considerations
  • Power requirements
  • Network cabling
  • Network address assignments for the customer network

Much of this may be found in the Oracle Exadata Database Machine Owner’s Guide for 11g Release 2 which is made available to customers who order the machine.

The Oracle Sun field engineers are involved with the setup of the Database Machine at the customer site, and handle the following types of activities:

  • Connecting power to the racks
  • Cabling multiple full racks together or half and full racks on the same Infiniband fabric if required
  • Configuring the Infiniband switches
  • Configuring the Cisco Ethernet switch
  • Configuring the Keyboard, Video and Mouse (KVM) hardware if present
  • Verifying the configured network topology

Once  the Sun engineers are finished, then the ACS engineers configure the software on all the Database Servers and Storage Servers within the Database machine. In the original V1 Exadata days, each machine was configured for the customer network during the first boot of the machine. This first time boot prompted for the following:

  • Name server details
  • Time zone details
  • NTP server details
  • DNS details
  • Network Interface IP address assignment for management network on adaptor “eth0
  • Other network details for client network if one is used for accessing the database servers
  • Network details for the bonded Infiniband adaptor  “bond0” used for the storage network and cluster interconnect
  • Configuration details for the Integrated Lights Out Manager (ILOM)
    • Hostname
    • Network details
    • DNS details
    • NTP details
    • Timezone details

Once this had been completed, the servers could be connected to the customer network. Then the Grid Infrastructure software and database software were installed and configured, followed by the creation of a database on the Database machine.

This interactive method is no longer used, as ACS have developed automated techniques to facilitate installation and configuration. The process, which has evolved over time, begins with the filling in of a configuration spreadsheet in consultation with the customer. Details of this may be found in the Oracle Exadata Database Machine Configuration Worksheets for 11g Release 2 documentation. The ACS engineer fills in details of all the information listed above for network and ILOM using ranges of IP addresses, plus all the other details mentioned.

In addition, the following is done on the spreadsheet:

  • Selecting Full, Half or Quarter Rack to match hardware ordered
  • Selecting High Performance or High Capacity SAS disks to match hardware ordered
  • Entering database node and cell node name prefixes
  • Entering rack name prefix
  • Entering details for notification of alerts by the cells
    • SMTP details for email notification of alerts
    • SNMP Subscriber details for SNMP traps
  • Entering OS user details for the Grid Infrastructure software install
  • Entering OS user details for the Oracle Database software install
  • Entering the number, names and size of ASM diskgroups to be created
  • Entering details of databases to be created

Once the spreadsheet is complete, the config details may be generated as output lower down in the sheet for review. Then the data is stored in a set of files generated by some spreadsheet macros. These files are loaded onto the first database server by the ACS engineer at the customer site and used for configuring the machines in the rack.

The machines are shipped with OS images pre-installed with default IP addresses and subnet masks. This permits the ACS engineer to access the first database node using a PC connected to that network, via the ILOM console. Alternatively for X2-2 configurations, the KVM may be used. The spreadsheet config files are copied to the “/opt/oracle.Supporttools/onecommand” directory and used as input to automation scripts:

  • /opt/oracle.Supporttools/firstconf/applyconfig.sh updates the IP addresses for servers in rack and reboots them
  • The Ethernet cables from the built in switch are then connected for the management network to each server
  • The Ethernet cables from the built in switch are then connected for the ILOM cards on each server
  • If a client access network is used, then Ethernet cables from another corporate switch are connected to each server

Now that all the servers are on the required networks, the installation and configuration of the software and creation of the databases may be done. This uses /opt/oracle.Supporttools/onecommand/deploy<nnn>.sh script, where nnn is a number that is release specific and tailoring enhancements may be  made to the process. This command has many steps which are contained in a config file. The ACS engineer may edit files to tailor this generic solution to the customer requirements.

The Oracle Grid Infrastructure is installed with a silent install and configured automatically as is the Oracle Database software. Finally, if a database is created, then this too will be done silently.

My ACS Colleagues inform me that a full rack required about 5 hours to do this including time required for downloading any patches required in addition to the base software releases.

If you are planning on Exadata, then I hope this helps in appreciating the process involved.

Joel

2/2011

Posted in Oracle | 4 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 53 other followers