TLS, RAC and externally signed certificates

Sven Illert -

Recently I was working with a customer on a problem that once again was solved with an easy fix but which took quite some time to be found. So first let’s describe the situation: The environment consists of a Real Applciation Cluster with two nodes and several container databases. The software (all latest 19c) is set up with no role separation and a simple basic network setup consisting of 2 node ip addresses, 2 node vips and 3 SCAN ip addreses resolving to one SCAN name. A pretty basic setup and well known with many installations.

Customers experience

The customer wanted to setup TLS by himself following existing how-to’s created from Oracle and the community. After setup for some reason they got an error (listener could not find available handler with matching protocol stack) when trying to access a database using the SCAN listener via TCPS endpoint and their supporter couldn’t find a solution within three weeks. To check nothing was overseen the whole configuration was overthrown and I held a 2 day workshop with the customer about TLS configuration in Oracle environments.

Certificates are managed centrally with a self service support portal which is common in larger companies and the internal root CA is deployed to all systems including the RAC servers. So the customer requested a certificate for the SCAN and added any hostnames and ip addresses as subject alternative names. This should avoid any DN matching problems later. So we went through basic setup of a wallet in one location (WALLET_LOCATION=…) and disabled mTLS (SSL_CLIENT_AUTHENTICATION=FALSE) in the sqlnet.ora and listener.ora of the GI home and in the sqlnet.ora of the database home. We restarted every component and were able to connect using the local listener. When we tried to connect using the SCAN we got the same issue mentioned above and then we fiddled around with various combinations of different locations and combinations. But with no success.

One thing to mention that becomes relevant later again is that Oracle seems to have backported the feature for validation of a server certificate at client side against a systems trusted root CAs installed at the OS level to 19c. We were able to connect to the local listener from a separate client machine with the 19c client including latest patches without creating a wallet. That at least was a nice finding and as noted in MOS note 2889040.1 it was already added in 19.14 which I totally missed.

Lab setup

So I went to my company’s lab and tried to investigate the behaviour on my own. I built up a CA on a linux system and registered it in the trust store in the OSes involved. Did I already mention that openssl and certificate management is really a mess? Anyways … I set up the test environment as close to the customers system as possible (just one IP for the SCAN) and configured a wallet as follows.

% WRL=/u01/app/oracle/admin/wallet
% orapki wallet create -wallet $WRL -auto_login
% orapki wallet import_pkcs12 -wallet $WRL -pkcs12file vm100.p12
% orapki wallet display -wallet $WRL
Oracle PKI Tool Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Copyright (c) 2004, 2024, Oracle and/or its affiliates. All rights reserved.

Requested Certificates: 
User Certificates:
Subject:        C=DE,CN=vm100.support.robotron.de
Trusted Certificates: 

With the file vm100.p12 being the CA signed certificate including private key and the root CA installed at OS level I thought I was ready to go. So I configured the GI and database home with both sqlnet.ora and the responding listener.ora to use the wallet.

SSL_CLIENT_AUTHENTICATION=FALSE
 
WALLET_LOCATION=
        (SOURCE=
                (METHOD=file)
                (METHOD_DATA=(DIRECTORY=/u01/app/oracle/admin/wallet))
        )

The outcome of this configuration is that from a separate machine the 19c SQL*Plus was able to connect to the local listener (vm103) running on the virtual IP address but not to the SCAN (vm100).

% sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm103.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 19:51:53 2024
Version 19.24.0.0.0
SQL> exit
% sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm100.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 19:53:13 2024
Version 19.24.0.0.0
ERROR:
ORA-12516: TNS:listener could not find available handler with matching protocol stack

This was a quite weird situation. The trust chain seems to be validated by SQL*Plus somehow as we can see from the first try with the local listener but something seems to be wrong with the SCAN. So I checked services and saw that the endpoints for the TCPS protocol were in state blocked.

% lsnrctl services LISTENER_SCAN1
Service "SI02V100.support.robotron.de" has 2 instance(s).
  Instance "SI02V1001", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.16.36.103)(PORT=1521)))
      "DEDICATED" established:0 refused:0 state:blocked
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.16.36.103)(PORT=1522)))
  Instance "SI02V1002", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:0 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.16.36.104)(PORT=1521)))
      "DEDICATED" established:0 refused:0 state:blocked
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.16.36.104)(PORT=1522)))

Any research regarding this error pattern went into a dead end. I played around with all the suggestions regarding local_listener and remote_listener instance parameters and tried several wallet locations like /ect/ORACLE/WALLETS/oracle or the instance parameter variant wallet_location. Nothing helped. I also set up the wallet various times as described in any how-to I could find in MOS or the internet.

There seemed to be a problem at the point where the scan listener wants to hand over the connection to the local listener. Since the certificate is used at any involved endpoints and the CA is registered in the OS and Oracle 19c uses the systems PKI for validation that should work, right? Right?

Wrong. The assumption that any component would validate against an operating systems CA truststore wasn’t backed by the documentation for the 23ai, since it mentions only client validation. So is the SCAN listener a client to the local listener or other way around? Doesn’t seem so. Additionally I dug into the documentation of the orapki utility and read about trust flags. They describe how validation works with Oracle. These flags typically are set automatically, but you may change your wallet to have fine grained control to them. But this is not what I wanted, I just wanted to understand. So there are basically 3 flags of interest in this case.

Since I and the customer only imported a signed certificate the wallet contained one element with the flag NULL. So the communication between the listeners wasn’t trustworthy at all, as they didn’t seem to have a flagged element that would validate (SERVER_AUTH or VALID_PEER) and didn’t validate against system CAs in the end.

The solution

So how would we fix that? We need to add an element to the wallet that would establish trust for the leaf certificate that’s used by the database and the listeners. This of course meant to add the CA to the wallet.

% orapki wallet add -wallet $WRL -trusted_cert -cert vm100-RootCA.crt
% orapki wallet display -wallet $WRL 
Oracle PKI Tool Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Copyright (c) 2004, 2024, Oracle and/or its affiliates. All rights reserved.

Requested Certificates: 
User Certificates:
Subject:        C=DE,CN=vm100.support.robotron.de
Trusted Certificates: 
Subject:        CN=RootCA,OU=Technologie & Services,O=Robotron,L=Dresden,ST=Saxony,C=DE

We can see that the connection now works for the SCAN too, after a restart of all components.

$ sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm100.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 20:37:45 2024
Version 19.24.0.0.0
SQL> 

Conclusion

So what’s to learn about in the end? Above is a combination of situations that made things a little bit more complicated for debugging as usual.

I hope this helps people who might run into the same problem.