TLS, RAC and externally signed certificates
Sven Illert -
Recently I was working with a customer on a problem that once again was solved with an easy fix but which took quite some time to be found. So first let’s describe the situation: The environment consists of a Real Applciation Cluster with two nodes and several container databases. The software (all latest 19c) is set up with no role separation and a simple basic network setup consisting of 2 node ip addresses, 2 node vips and 3 SCAN ip addreses resolving to one SCAN name. A pretty basic setup and well known with many installations.
Customers experience
The customer wanted to setup TLS by himself following existing how-to’s created from Oracle and the community. After setup for some reason they got an error (listener could not find available handler with matching protocol stack) when trying to access a database using the SCAN listener via TCPS endpoint and their supporter couldn’t find a solution within three weeks. To check nothing was overseen the whole configuration was overthrown and I held a 2 day workshop with the customer about TLS configuration in Oracle environments.
Certificates are managed centrally with a self service support portal which is common in larger companies and the internal root CA is deployed to all systems including the RAC servers. So the customer requested a certificate for the SCAN and added any hostnames and ip addresses as subject alternative names. This should avoid any DN matching problems later. So we went through basic setup of a wallet in one location (WALLET_LOCATION=…
) and disabled mTLS (SSL_CLIENT_AUTHENTICATION=FALSE
) in the sqlnet.ora
and listener.ora
of the GI home and in the sqlnet.ora
of the database home. We restarted every component and were able to connect using the local listener. When we tried to connect using the SCAN we got the same issue mentioned above and then we fiddled around with various combinations of different locations and combinations. But with no success.
One thing to mention that becomes relevant later again is that Oracle seems to have backported the feature for validation of a server certificate at client side against a systems trusted root CAs installed at the OS level to 19c. We were able to connect to the local listener from a separate client machine with the 19c client including latest patches without creating a wallet. That at least was a nice finding and as noted in MOS note 2889040.1 it was already added in 19.14 which I totally missed.
Lab setup
So I went to my company’s lab and tried to investigate the behaviour on my own. I built up a CA on a linux system and registered it in the trust store in the OSes involved. Did I already mention that openssl and certificate management is really a mess? Anyways … I set up the test environment as close to the customers system as possible (just one IP for the SCAN) and configured a wallet as follows.
% WRL=/u01/app/oracle/admin/wallet
% orapki wallet create -wallet $WRL -auto_login
% orapki wallet import_pkcs12 -wallet $WRL -pkcs12file vm100.p12
% orapki wallet display -wallet $WRL
Oracle PKI Tool Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Copyright (c) 2004, 2024, Oracle and/or its affiliates. All rights reserved.
Requested Certificates:
User Certificates:
Subject: C=DE,CN=vm100.support.robotron.de
Trusted Certificates:
With the file vm100.p12
being the CA signed certificate including private key and the root CA installed at OS level I thought I was ready to go. So I configured the GI and database home with both sqlnet.ora
and the responding listener.ora
to use the wallet.
SSL_CLIENT_AUTHENTICATION=FALSE
WALLET_LOCATION=
(SOURCE=
(METHOD=file)
(METHOD_DATA=(DIRECTORY=/u01/app/oracle/admin/wallet))
)
The outcome of this configuration is that from a separate machine the 19c SQL*Plus
was able to connect to the local listener (vm103) running on the virtual IP address but not to the SCAN (vm100).
% sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm103.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 19:51:53 2024
Version 19.24.0.0.0
…
SQL> exit
% sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm100.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 19:53:13 2024
Version 19.24.0.0.0
…
ERROR:
ORA-12516: TNS:listener could not find available handler with matching protocol stack
This was a quite weird situation. The trust chain seems to be validated by SQL*Plus
somehow as we can see from the first try with the local listener but something seems to be wrong with the SCAN. So I checked services and saw that the endpoints for the TCPS protocol were in state blocked.
% lsnrctl services LISTENER_SCAN1
…
Service "SI02V100.support.robotron.de" has 2 instance(s).
Instance "SI02V1001", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:1 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.16.36.103)(PORT=1521)))
"DEDICATED" established:0 refused:0 state:blocked
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.16.36.103)(PORT=1522)))
Instance "SI02V1002", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.16.36.104)(PORT=1521)))
"DEDICATED" established:0 refused:0 state:blocked
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.16.36.104)(PORT=1522)))
…
Any research regarding this error pattern went into a dead end. I played around with all the suggestions regarding local_listener
and remote_listener
instance parameters and tried several wallet locations like /ect/ORACLE/WALLETS/oracle
or the instance parameter variant wallet_location
. Nothing helped. I also set up the wallet various times as described in any how-to I could find in MOS or the internet.
There seemed to be a problem at the point where the scan listener wants to hand over the connection to the local listener. Since the certificate is used at any involved endpoints and the CA is registered in the OS and Oracle 19c uses the systems PKI for validation that should work, right? Right?
Wrong. The assumption that any component would validate against an operating systems CA truststore wasn’t backed by the documentation for the 23ai, since it mentions only client validation. So is the SCAN listener a client to the local listener or other way around? Doesn’t seem so. Additionally I dug into the documentation of the orapki
utility and read about trust flags. They describe how validation works with Oracle. These flags typically are set automatically, but you may change your wallet to have fine grained control to them. But this is not what I wanted, I just wanted to understand. So there are basically 3 flags of interest in this case.
SERVER_AUTH
- This is assigned to trusted root CAs and during validation of the chain if this flag is reached, it is passed.VALID_PEER
- This is assigned to self-signed certificates and also means a pass during validation of the chain. For self-signed certificates the amount of elements in the chain is obviously 1.NULL
- This is assigned to intermediate CAs and leaf certificates. When the last element in the chain validation has this flag, the validation fails.
Since I and the customer only imported a signed certificate the wallet contained one element with the flag NULL
. So the communication between the listeners wasn’t trustworthy at all, as they didn’t seem to have a flagged element that would validate (SERVER_AUTH
or VALID_PEER
) and didn’t validate against system CAs in the end.
The solution
So how would we fix that? We need to add an element to the wallet that would establish trust for the leaf certificate that’s used by the database and the listeners. This of course meant to add the CA to the wallet.
% orapki wallet add -wallet $WRL -trusted_cert -cert vm100-RootCA.crt
% orapki wallet display -wallet $WRL
Oracle PKI Tool Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Copyright (c) 2004, 2024, Oracle and/or its affiliates. All rights reserved.
Requested Certificates:
User Certificates:
Subject: C=DE,CN=vm100.support.robotron.de
Trusted Certificates:
Subject: CN=RootCA,OU=Technologie & Services,O=Robotron,L=Dresden,ST=Saxony,C=DE
We can see that the connection now works for the SCAN too, after a restart of all components.
$ sqlplus system@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SI02V100.support.robotron.de))(ADDRESS=(PROTOCOL=tcps)(HOST=vm100.support.robotron.de)(PORT=1522)))"
SQL*Plus: Release 19.0.0.0.0 - Production on Fri Nov 1 20:37:45 2024
Version 19.24.0.0.0
…
SQL>
Conclusion
So what’s to learn about in the end? Above is a combination of situations that made things a little bit more complicated for debugging as usual.
- A RAC involves more than one listener and they don’t seem to trust each other even if they are using the same wallet. Single system setups wouldn’t have failed.
- Enforcing trust with a company wide CA seems to make wallet and pki layout a little bit more complex.
- The backport of features is a nice thing, but be aware of any implications you make that might not be documented anywhere.
- Tracing the listener communication that leads to the blocked state wasn’t obviously possible. Any hints?
I hope this helps people who might run into the same problem.