SCAN you LISTEN to this?
Posted by Joel Goodman on 18/03/2010
My good friend and colleague Harald van Breederode recently posted an article on How to Set Up a Private DNS for Your Virtual Cluster where he discussed DNS and the steps he took to create both master and slave DNS servers on his virtual RAC cluster. The article then showed how to statically define IP addresses in the DNS configuration files and how to generate the configuration and zone files. One of the “hostnames” listed had 3 IP addresses; that of the “Single Client Access Name” or SCAN as it is known.
I taught the 11gR2 Grid Infrastructure and RAC course for the first time in EMEA last week and many questions arose about SCANs and SCAN vips as many of the delegates on the course were experienced DBAs who had been using either Oracle 10g or 11g Clusterware and RAC. Some of them were conceptual and some interesting discussions arose about the SCAN vips and SCAN listeners in contrast to vips and listeners in 10g and 11g, so I thought it worth sharing.
1. Architecture for vips and listeners in 10g and 11gr1
In Oracle 10g and 11gR1 each node in the cluster has a Virtual IP Address (VIP) which is activated on the public adaptor and is used by the listener on that node. The vip is also a resource registered in the Oracle Clusterware Repository (OCR) file and the listener usually runs from the ASM home if ASM is used although this can be changed with SRVCTL.
Here is some output from the LSNRCTL utility:
LSNRCTL for Linux: Version 10.2.0.4.0 – Production on 30-SEP-
Copyright (c) 1991, 2008, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
Version TNSLSNR for Linux:
Version 10.2.0.4.0 – Production
Start Date 30-SEP-2009 14:42:04
Uptime 0 days 0 hr. 55 min. 11 sec
Trace Level off
Security ON: Local OS Authentication
Listener Parameter File /u01/app/10.2.0/asm_1/network/admin/listener.ora
Listener Log File /u01/app/10.2.0/network/log/lsnr.log
Listening Endpoints Summary…
Service “+ASM” has 1 instance(s).
Instance “+ASM1”, status READY, has 1 handler(s) for this service…
Service “ora10g.mynode.com” has 1 instance(s).
Instance “ora10g1”, status READY, has 1 handler(s) for this service…
Service “ora10gXDB.mynode.com” has 1 instance(s).
Instance “ora10g1”, status READY, has 1 handler(s) for this service…
The command completed successfully
10g and 11gR1 VIP Usage and Implications
In the output above, IP address 192.168.226.201 is a vip normally used by this listener on this node say node 1 for example. But if node 1 fails, then the vip is failed over to a surviving cluster node which activates the vip. If adaptor ETH0 is the public interface then on the surviving node say node 3, then ETH0:1 will be the virtual adaptor for the vip used by the listener on node 3 and ETH0:2 will be the failed over vip from node 1.
This failover of vips facilitates connect time failover so that a client or middle tier trying to connect to that listener using a load balanced connection will get an error returned from the node to which the vip failed over as no listener is using that vip. In my example node 3 has ETH0:2 for IP address 192.168.226.201 but no listener is listening on that address on node 3 so an error is returned to the client which then tries another listener address at random from the tnsnames.ora entry.
Once the failed node 1 is restarted, the vip is deactivated on node 3, the adaptor ETH0:2 is no longer active and the vip is reactivated on Node 1 as adaptor ETH0:1 and a listener on node 1 once again listens on that vip.
So a vip is used differently when failed over than it is when on its normal “home” node.
Note: This is true for Database vips. Application vips behave the same on any node where they are activated but that is beyond the scope of this discussion.
10g and 11gR1 Listener Usage and Implications
RAC Listeners from Oracle 8i onwards have had several jobs to do in supporting connection requests made by clients and middle tiers to RAC Databases:
1. Managing incoming connection requests including normal and failover connections situations
Connection requests may arrive at any time but can peak if many login requests occur simultaneously. This occurs at instance failure for example if Transparent Application Failover (TAF) is used with BASIC rather than pre-connected sessions. At such a time listeners must handle many requests depending on how many connections existed to the failed instance, and how many surviving nodes and listeners exist. For each connection request the listener must perform a load balance decision (see 2 below) and then either spawn and bequeath or redirect the request (see 3 and 4 below). Occasionally listeners may reject requests if too many are queued and this may be seen on a per service basis using “lsnrctl services” command. This is why TAF parameters exist for RETRY and DELAY. The implications for this are delays to reconnection which affects availability to the user.
2. Making Load Balancing decisions
The Listener makes “Connection Time” load balancing decisions as to which instance will be used for the connection. This decision is made by whichever listener the client or middle tier connects to as all the listeners on all the nodes should be referenced in the REMOTE_LISTENER parameter of each database instance. Since Oracle 10g, the load balancing decision may be made on a “per service” basis as each service can choose from amongst different load balancing methods:
- Session count for long duration sessions
- Node run queue length for short sessions with no metrics
- Service metrics based on either ELAPSED TIME PER CALL or CPU TIME PER CALL
3. Performing a “Spawn and Bequeath” request if the connection is to be made to an instance on the same cluster node
If the listener decides to connect the client to an instance on the same node due to a load balancing decision or because the only instance accepting logins for the service is on the same node, the the traditional spawn and bequeath method may be used. The listener spawns an oracle process from the Oracle home associated with the instance hosting the service and bequeaths the transport connection used between the client and the listener over to the new process which then communicates with the client.
4. Performing a “Redirect” request if the connection is to be made to an instance on another cluster node
Since Spawn and Bequeath is not possible for cases where the connection is made to an instance on a remote node, the listener making the load balancing decision, must employ the help of the listener on the node hosting the chosen instance. The listener first contacted by the client returns the address of the listener on the chosen node to the client which in turn contacts that listener specifying that it must connect to the specific instance on that node.
The behaviour of the vips and the listeners prior to 11gR2 has the following implications:
- Database vips do not behave as normal vips. They fail over only to prevent the TCP timeout delay but are not used in the same way as they are when not failed over.
- The listeners perform many different functions and when a “login storm” occurs after a failure there may be delays and retries because the same listeners are making load balancing decisions and performing spawn and bequeath connection processing
- The code path and time required to establish a connection to an instance varies depending on whether the chosen instance for the service is hosted on the same node as the chosen listener.
Enter SCAN VIPS and SCANs in 11gR2
From 11gR2 the behaviour of the vips and the listeners is modified with the introduction of SCAN Technology. This includes:
- Single Client Access Name
- SCAN VIPS
- SCAN Listeners
- Node or Local Listeners
Part of this architecture follows on from the implications listed above. Essentially there are now two layers of listeners:
- Up to Three SCAN listeners – which manage the original client connection requests and which make load balancing decisions. Client connect requests always go to the SCAN listeners (except for the upgrade situation mentioned below). Much has already been written about the SCAN addresses but to summarise there can be:
- A single SCAN address, vip resource and listener if the server side hosts file is used to resolve the scan name but in this case the SCAN LISTENER is not highly available if the hosting node fails or the listener goes down.
- Three SCAN addresses statically defined in a DNS configuration with the same SCAN name as was demonstrated in Harald’s article. Note: It is also possible to have the three static SCAN addresses managed in the corporate DNS and not in a qualified subdomain or zone. This will work perfectly for rather static clusters and is probably a good choice for many customers.
- Three SCAN addresses automatically acquired from DHCP and managed in the clusterware using GNS and mDNS and requiring a delegated subdomain in the corporate DNS server configuration as the IP addresses and host names will not be known outside the cluster until resolved at least once and cached in the DNS Servers for their TTL duration. This choice facilitates Grid Plug and Play (gPNP) whereby new nodes may be added to the cluster without requiring any changes to the corporate DNS configuration as DHCP assigns the IP Addresses and the host names to IP address mapping is updated dynamically by using mDNS protocol so that GNS is then able to resolve the new host names properly. It requires only a one off setup of the delegated subdomain in the corporate DNS setup and DHCP setup as well. Note that this requires the “Advanced” configuration be chosen when installing the Grid Infrastructure.
- Each SCAN listener has a SCAN VIP and both are so called “CLUSTER RESOURCES” in the OCR and which may be displayed using the CRSCTL command.
- Three SCANs and SCAN Listeners was chosen as the optimal number to make sure that at least two SCANs and SCAN listeners should be active to handle login storms if a node hosting a SCAN fails.
- Node listeners on each node which connect clients to the chosen instance – these listeners are not normally contacted by the clients at the initial connection request and not referenced by the REMOTE_LISTENER parameter except for upgraded RAC databases from prior releases. Each local listener also has a vip and these are examples of “LOCAL RESOURCES” in the OCR.
The behaviour of these components differs from the description of listeners and vips prior to 11gR2.
- Clients always connect to SCAN Listeners (except for the upgrade situation mentioned above) and the DNS resolution achieves a round robin client side load balance. This relieves the client from requiring a tnsnames.ora entry with an address list for the listeners and permits load balancing using simple connect strings such as “EZ CONNECT” style or Java style syntax.
- SCAN Listeners always use the REDIRECT method for passing the connection request on to a Node listener even if the Node Listener is on the same node as the SCAN Listener. This has the effect of evening out the code path for connection requests.
- If a Node fails, then the SCAN Vip fails over to another node as in the pre-11gr2 case but the SCAN Listener fails over as well. This means that the SCAN vip behaves the same on all nodes IE it has a listener listening on the vip and is in marked contrast to the case prior to 11gR2.
- Node Listeners only connect clients or middle tiers to the instance hosting the service chosen by the SCAN listener performing load balancing. The separation of load balancing and initial connection handling from the “spawn and bequeath” processing means that SCAN listeners have less to do when a login storm occurs since all they do is send out redirect packets. By having two layers of listeners, the system can be more responsive than with one layer of listeners by effectively parallelising all the connection requests.
A final word about “CLUSTER RESOURCES” and “LOCAL RESOURCES”.
when doing “crsctl stat res -t” the output is divided into these two categories.
- CLUSTER RESOURCES are those which can run on any node of a cluster and may fail over. Examples are SCAN vips, SCAN Listeners, GNS vip, GNS, Database instances, Database services and more. There is no requirement for multiple occurrences of a resource for it to be considered a CLUSTER resource. For example there is only ever one GNS and one GNS Vip but they both can be anywhere on the cluster. Or there are three SCAN vips and SCAN Listeners regardless of the number of cluster nodes and they may run anywhere. Or the database instances for a database may run on any servers in the server pool to which it is assigned.
- LOCAL RESOURCES are those which either run on a specific node or do not run at all. Examples are Local listeners which either run or don’t run, but which don’t fail over as do SCAN listeners. Or ASM Instances which run one per node or don’t run if the node is down but which do not fail over. An Example of a Local resource on only one node may be when a node is down or a resource is down on another node.
So there are both Local and Cluster resources running on multiple nodes some may run on only one node.
We certainly had fun last week discussing and debating the SCAN Architecture and the changes from previous releases and it is hoped that readers will enjoy some of these observations.