Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSLException: closing inbound before receiving peer's close_notify #181

Closed
AntonSmolkov opened this issue May 18, 2021 · 7 comments · Fixed by #274
Closed

SSLException: closing inbound before receiving peer's close_notify #181

AntonSmolkov opened this issue May 18, 2021 · 7 comments · Fixed by #274
Labels
bug Something isn't working

Comments

@AntonSmolkov
Copy link

Greetings! I've managed to install piraeus-operator with SSL for all components.
Everything seems to work fine, but when i do kubectl linstor error-reports list i see this:
image
It occurs almost every second

One of such error reports ERROR REPORT 60A3B5F0-236CD-000285

============================================================

Application: LINBIT�� LINSTOR
Module: Satellite
Version: 1.12.3
Build ID: d4e2cbfcb3819600208b3e4849e9efa6ddb50a52
Build time: 2021-05-07T06:20:48+00:00
Error time: 2021-05-18 13:28:59
Node: okd-sds-hcqw8-worker-northeurope1-new-lbm46
Peer: 10.0.32.10:48378

============================================================

Reported error:

Category: Exception
Class name: SSLException
Class canonical name: javax.net.ssl.SSLException
Generated at: Method 'createSSLException', Source file 'Alert.java', Line #133

Error message: closing inbound before receiving peer's close_notify

Error context:
I/O exception while attempting to receive data from the peer

Call backtrace:

Method                                   Native Class:Line number
createSSLException                       N      sun.security.ssl.Alert:133
createSSLException                       N      sun.security.ssl.Alert:117
fatal                                    N      sun.security.ssl.TransportContext:336
fatal                                    N      sun.security.ssl.TransportContext:292
fatal                                    N      sun.security.ssl.TransportContext:283
closeInbound                             N      sun.security.ssl.SSLEngineImpl:733
doHandshake                              N      com.linbit.linstor.netcom.ssl.SslTcpConnectorHandshaker:118
read                                     N      com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:162
run                                      N      com.linbit.linstor.netcom.TcpConnectorService:543
run                                      N      java.lang.Thread:829

END OF ERROR REPORT.

Questions:
Is there any way to fix it?
Can it consume all my node's free space?
I have no idea where this logs are stored, i found /var/log/linstor-controller/error-report.mv.db file and suspect it.
It has size of 16MiB and doesn't grow (rotation?)

Info:
Operator Version: 1.5.0
Envinronmet: OKD 4.6, FCOS 33

@WanzenBug WanzenBug added the bug Something isn't working label May 18, 2021
@WanzenBug
Copy link
Member

Hi!

I believe this is related to a bad interaction between a Linstor Satellites SSL connection and the container livenessprobe. Looks like the Java 11 SSL implementation throws this exception rather often if the client just opens a connection and closes it again.

Not really sure what can be done there. You can open an issue here, I guess that is something that needs to be fixed upstream.

I have no idea where this logs are stored, i found /var/log/linstor-controller/error-report.mv.db file and suspect it.

You probably need to look at /var/log/linstor-satellite in one of the satellite containers for that.

@AntonSmolkov
Copy link
Author

Thanks for fast response, @WanzenBug.
I see only readiness probe on satellites container, and it's plain TCP.

      name: linstor-satellite
        ports:
        - containerPort: 3367
          hostPort: 3367
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 3367
          timeoutSeconds: 5

May be this is the reason of Java 11 behaviour and if we change it to HTTPS, everything will be fine.
Something like:

        name: linstor-satellite
        ports:
        - containerPort: 3367
          hostPort: 3367
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          httpGet:
            path: /
            port: 3367
            scheme: HTTPS

@WanzenBug
Copy link
Member

Unfortunatly this will not work, as the satellites are not using HTTP(S) as transport protocol but rather a ProtoBuf based RPC. So the TCP probe is the best thing we got :-/

@Boca13
Copy link

Boca13 commented May 25, 2021

I was finding the same messages and I also needed a livenessProbe, so that satellites can restart and reconnect automatically.

For now, and until we find a better solution, I'm testing this sketchy probe:

livenessProbe:
  exec:
    command:
    - sh
    - "-c"
    - "linstor node list | grep `hostname` | grep -v OFFLINE || (linstor node restore `hostname`; exit 1)"
  initialDelaySeconds: 30
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 10

I don't like it, but so far it's proven very useful, because it detects when the satellite is crashed and not connected to the controller, even if it's accepting connections (like in my case). It also tries to do a node restore if it's offline, since sometimes the controller doesn't try to reconnect automatically. In case the livenessProbe fails (and can't restore) and the container is restarted, the node restore command will make the controller connect to the now working satellite.

It's not too elegant, but it solved many of my problems. I hope it helps.

@AntonSmolkov
Copy link
Author

Thank you, @Boca13.
This is, apparenlty, also nice workaround for LINBIT/linstor-server#187
@kvaps take a look

@kvaps
Copy link
Member

kvaps commented Aug 9, 2021

@AntonSmolkov thanks for the pointing, just implemented this workaround in kube-linstor v0.14.0
kvaps/kube-linstor@3cec5ad

@kvaps
Copy link
Member

kvaps commented Sep 14, 2021

I just found that this workaround putting my nodes to OFFLINE(OTHER_CONTROLLER) state for some reason

┊ m13c42 ┊ SATELLITE ┊ 10.36.130.132:3367 (SSL) ┊ OFFLINE(OTHER_CONTROLLER) ┊
┊ m13c43 ┊ SATELLITE ┊ 10.36.130.133:3367 (SSL) ┊ OFFLINE(OTHER_CONTROLLER) ┊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants