Prometheus failing to reload Probe TLS cert and key from disk #598

lpetrazickisupgrade · 2024-03-06T17:25:20Z

I'm running Prometheus Operator 0.71.2 with Prometheus 2.49.1 on EKS

I have metric endpoints protected by TLS cert and key. Teleport Tbot rotates the cert and key every n hours and writes them to a secret. There's a Probe resource that refers to that secret. Prometheus Operator loads the Probe into a Prometheus instance and rewrites the secret for that instance. Prometheus uses the rewritten secret to access the endpoint

What I'm seeing is that:

Prometheus fails to reload the cert and key and hits a 403 Forbidden for either a couple hours or indefinitely after a cert rotation
Triggering a config reload does not reload the cert and key
Sending a SIGHUP to the Prometheus process does not reload the cert and key
Sending a SIGTERM to the Prometheus process does reload the cert and key by restarting that pod

The secrets look up to date on the Prometheus pod filesystem during the issue

Probe definition:

apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: probe-foo
  namespace: monitoring
spec:
  interval: 30s
  jobName: probe-foo
  prober:
    path: /metrics
    scheme: https
    url: foo-exporter.access-proxy.example.com
  scrapeTimeout: 20s
  targets:
    staticConfig:
      static:
      - foo:443
  tlsConfig:
    cert:
      secret:
        key: tlscert
        name: tbot-prometheus-foo
    insecureSkipVerify: true
    keySecret:
      key: key
      name: tbot-prometheus-foo

Generated config:

- job_name: probe/monitoring/probe-foo
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 30s
  scrape_timeout: 20s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /metrics
  scheme: https
  enable_compression: true
  tls_config:
    cert_file: /etc/prometheus/certs/secret_monitoring_tbot-prometheus-foo_tlscert
    key_file: /etc/prometheus/certs/secret_monitoring_tbot-prometheus-foo_key
    insecure_skip_verify: true
  follow_redirects: true
  enable_http2: true
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (.*)
    target_label: __tmp_prometheus_job_name
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: job
    replacement: office-metrics-foo
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: foo-exporter.access-proxy.example.com
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    modulus: 1
    target_label: __tmp_hash
    replacement: $1
    action: hashmod
  - source_labels: [__tmp_hash]
    separator: ;
    regex: "0"
    replacement: $1
    action: keep
  static_configs:
  - targets:
    - foo:443
    labels:
      namespace: monitoring

This sounds similar to #345 but still happening today

lpetrazickisupgrade · 2024-03-07T21:31:58Z

What seems to be happening is that Prometheus loads updated certs into new connections but not existing connections:
https://github.com/prometheus/common/blob/v0.50.0/config/http_config.go#L979

Connections are set to remain open unless they are idle for 5 minutes. As long as the scrape interval is significantly shorter than 5 minutes, they remain open indefinitely:
https://github.com/prometheus/common/blob/main/config/http_config.go#L54

One possible enhancement could be for Prometheus to flush any connection that hits a 403 error

filippog · 2024-04-30T10:03:09Z

We are seeing the same too, namely k8s tls_config certs are not used for existing connections and eventually prometheus ends up using expired certificates for existing connections.

+1 to flush connections on 403 and/or on cert reload

alpine-algo · 2024-09-16T18:06:36Z

Did any of you find a solution/workaround to this, besides increasing scrape durations for tbot targets?
I'm running into the same issue and haven't found any robust solution...

I'll also +1 on implementing the 403 flush/reload... would help tremendously.

Edit:
After I wrote this, I did find a workaround that I'm currently using...

Instead of connecting to the tunnel endpoint directly, I just put nginx in front of it as the endpoint prometheus connects to. Now, no matter how quickly tbot rotates certs, there's always a connection. Works for me, for now, until a fix can be implemented in prometheus itself.

The big negative.... my scrape durations almost tripled from 40ms on the native tbot tunnel, to about 120ms through the additional reverse proxy layer.

roidelapluie · 2024-10-14T13:52:02Z

Hello

I tried to reproduce this by launching:

 ./prometheus --config.file prom-scrape-tls.yml

And the following code:

package main

import (
   "crypto/rand"
   "crypto/rsa"
   "crypto/tls"
   "crypto/x509"
   "crypto/x509/pkix"
   "encoding/pem"
   "flag"
   "fmt"
   "io/ioutil"
   "log"
   "math/big"
   "net"
   "net/http"
   "os"
   "sync"
   "time"
)

var (
   reloadInterval = flag.Duration("reload-interval", 15*time.Second, "Certificate reload interval")
   mutex          sync.Mutex
   latestCert     *x509.Certificate
   caCert         *x509.Certificate
   caKey          *rsa.PrivateKey
   serverCert     tls.Certificate
)

func main() {
   flag.Parse()

   // Load or generate CA certificate and key
   var err error
   caCert, caKey, err = loadOrCreateCA()
   if err != nil {
   	log.Fatalf("Failed to load or create CA certificate: %v", err)
   }

   // Generate initial server and client certificates
   err = generateServerAndClientCertificates()
   if err != nil {
   	log.Fatalf("Failed to generate certificates: %v", err)
   }

   // Generate Prometheus config
   err = generatePrometheusConfig()
   if err != nil {
   	log.Fatalf("Failed to generate Prometheus config: %v", err)
   }

   // Start the certificate reload goroutine
   go func() {
   	ticker := time.NewTicker(*reloadInterval)
   	for {
   		<-ticker.C
   		err := generateServerAndClientCertificates()
   		if err != nil {
   			log.Printf("Failed to regenerate certificates: %v", err)
   		} else {
   			log.Println("Certificates regenerated")
   		}
   	}
   }()

   // Start HTTPS server
   http.HandleFunc("/", handler)
   server := &http.Server{
   	Addr: ":8443",
   	TLSConfig: &tls.Config{
   		ClientAuth: tls.RequireAndVerifyClientCert,
   		ClientCAs:  x509.NewCertPool(),
   		GetCertificate: func(info *tls.ClientHelloInfo) (*tls.Certificate, error) {
   			mutex.Lock()
   			defer mutex.Unlock()
   			return &serverCert, nil
   		},
   	},
   }

   // Add CA certificate to ClientCAs
   server.TLSConfig.ClientCAs.AddCert(caCert)

   log.Println("Starting HTTPS server on :8443")
   log.Fatal(server.ListenAndServeTLS("", ""))
}

func handler(w http.ResponseWriter, r *http.Request) {
   if r.TLS == nil || len(r.TLS.PeerCertificates) == 0 {
   	http.Error(w, "Forbidden", http.StatusForbidden)
   	return
   }

   clientCert := r.TLS.PeerCertificates[0]
   mutex.Lock()
   isValid := clientCert.Equal(latestCert)
   mutex.Unlock()

   if !isValid {
   	http.Error(w, "Forbidden", http.StatusForbidden)
   	return
   }

   fmt.Fprintln(w, "ssl_ok 1")
}

func loadOrCreateCA() (*x509.Certificate, *rsa.PrivateKey, error) {
   caCertFile := "ca.cert"
   caKeyFile := "ca.key"

   // Check if CA certificate and key files exist
   if _, err := os.Stat(caCertFile); err == nil {
   	// Load existing CA certificate and key
   	certPEM, err := ioutil.ReadFile(caCertFile)
   	if err != nil {
   		return nil, nil, err
   	}
   	keyPEM, err := ioutil.ReadFile(caKeyFile)
   	if err != nil {
   		return nil, nil, err
   	}

   	caCert, err := loadCertificate(certPEM)
   	if err != nil {
   		return nil, nil, err
   	}
   	caKey, err := loadPrivateKey(keyPEM)
   	if err != nil {
   		return nil, nil, err
   	}

   	return caCert, caKey, nil
   }

   // Generate new CA certificate and key
   key, err := rsa.GenerateKey(rand.Reader, 2048)
   if err != nil {
   	return nil, nil, err
   }

   tmpl := &x509.Certificate{
   	SerialNumber: big.NewInt(1),
   	Subject: pkix.Name{
   		Organization: []string{"My CA"},
   	},
   	NotBefore:             time.Now(),
   	NotAfter:              time.Now().Add(10 * 365 * 24 * time.Hour),
   	KeyUsage:              x509.KeyUsageCertSign | x509.KeyUsageDigitalSignature,
   	BasicConstraintsValid: true,
   	IsCA:                  true,
   }

   derBytes, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &key.PublicKey, key)
   if err != nil {
   	return nil, nil, err
   }

   certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: derBytes})
   keyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(key)})

   // Save CA certificate and key
   err = ioutil.WriteFile(caCertFile, certPEM, 0644)
   if err != nil {
   	return nil, nil, err
   }
   err = ioutil.WriteFile(caKeyFile, keyPEM, 0600)
   if err != nil {
   	return nil, nil, err
   }

   cert, err := x509.ParseCertificate(derBytes)
   if err != nil {
   	return nil, nil, err
   }

   return cert, key, nil
}

func generateServerAndClientCertificates() error {
   mutex.Lock()
   defer mutex.Unlock()

   // Generate server private key
   serverKey, err := rsa.GenerateKey(rand.Reader, 2048)
   if err != nil {
   	return err
   }

   // Generate server certificate
   serialNumber, err := rand.Int(rand.Reader, big.NewInt(1<<62))
   if err != nil {
   	return err
   }

   tmpl := &x509.Certificate{
   	SerialNumber: serialNumber,
   	Subject: pkix.Name{
   		Organization: []string{"My Server"},
   	},
   	NotBefore:   time.Now(),
   	NotAfter:    time.Now().Add(*reloadInterval * 2),
   	KeyUsage:    x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature,
   	ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
   	IsCA:        false,
   	DNSNames:    []string{"localhost"},
   	IPAddresses: []net.IP{net.ParseIP("127.0.0.1")},
   }

   derBytes, err := x509.CreateCertificate(rand.Reader, tmpl, caCert, &serverKey.PublicKey, caKey)
   if err != nil {
   	return err
   }

   certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: derBytes})
   keyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(serverKey)})

   // Save cert and key
   err = ioutil.WriteFile("cert.crt", certPEM, 0644)
   if err != nil {
   	return err
   }
   err = ioutil.WriteFile("priv.key", keyPEM, 0600)
   if err != nil {
   	return err
   }

   // Update server certificate
   serverCert, err = tls.X509KeyPair(certPEM, keyPEM)
   if err != nil {
   	return err
   }

   // Generate client certificate
   clientKey, err := rsa.GenerateKey(rand.Reader, 2048)
   if err != nil {
   	return err
   }

   clientSerialNumber, err := rand.Int(rand.Reader, big.NewInt(1<<62))
   if err != nil {
   	return err
   }

   clientTmpl := &x509.Certificate{
   	SerialNumber: clientSerialNumber,
   	Subject: pkix.Name{
   		Organization: []string{"Prometheus Client"},
   	},
   	NotBefore:   time.Now(),
   	NotAfter:    time.Now().Add(*reloadInterval * 2),
   	KeyUsage:    x509.KeyUsageDigitalSignature,
   	ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
   	IsCA:        false,
   }

   clientDerBytes, err := x509.CreateCertificate(rand.Reader, clientTmpl, caCert, &clientKey.PublicKey, caKey)
   if err != nil {
   	return err
   }

   latestCert, err = x509.ParseCertificate(clientDerBytes)
   if err != nil {
   	return err
   }

   clientCertPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: clientDerBytes})
   clientKeyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(clientKey)})

   // Save client cert and key
   err = ioutil.WriteFile("client.crt", clientCertPEM, 0644)
   if err != nil {
   	return err
   }
   err = ioutil.WriteFile("client.key", clientKeyPEM, 0600)
   if err != nil {
   	return err
   }

   return nil
}

func generatePrometheusConfig() error {
   config := `scrape_configs:
 - job_name: 'my_https_job'
   scrape_interval: 5s
   scheme: https
   tls_config:
     ca_file: 'ca.cert'
     cert_file: 'client.crt'
     key_file: 'client.key'
     insecure_skip_verify: false
   static_configs:
     - targets: ['localhost:8443']
`
   return ioutil.WriteFile("prom-scrape-tls.yml", []byte(config), 0644)
}

func loadCertificate(certPEM []byte) (*x509.Certificate, error) {
   block, _ := pem.Decode(certPEM)
   if block == nil || block.Type != "CERTIFICATE" {
   	return nil, fmt.Errorf("failed to decode PEM block containing certificate")
   }
   cert, err := x509.ParseCertificate(block.Bytes)
   if err != nil {
   	return nil, err
   }
   return cert, nil
}

func loadPrivateKey(keyPEM []byte) (*rsa.PrivateKey, error) {
   block, _ := pem.Decode(keyPEM)
   if block == nil || block.Type != "RSA PRIVATE KEY" {
   	return nil, fmt.Errorf("failed to decode PEM block containing private key")
   }
   key, err := x509.ParsePKCS1PrivateKey(block.Bytes)
   if err != nil {
   	return nil, err
   }
   return key, nil
}

That code is a HTTP server in go that rotates priv and public key on disk and checks that the latest is used.

Prometheus seems to use the latest certificate all the time.

This is checked by:

common/config/http_config.go

Lines 1345 to 1347 in a9d2e3f

    
           equal := bytes.Equal(caHash[:], t.hashCAData) && 
        
           	bytes.Equal(certHash[:], t.hashCertData) && 
        
           	bytes.Equal(keyHash[:], t.hashKeyData)

So I do not see where is the problem for you.

roidelapluie · 2024-10-14T14:11:06Z

Oh, this is because my test uses a CA. If I set insecure and remove the CA, I have the bug.

The fix is here:

https://github.com/prometheus/common/pull/707/files

I might code a test for this.

This was referenced Oct 14, 2024

Reload certificates even when no CA is used #707

Merged

Ensure connection is re-established on 403 status by closing Keep-Alive. prometheus/prometheus#15132

Closed

SuperQ closed this as completed in #707 Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus failing to reload Probe TLS cert and key from disk #598

Prometheus failing to reload Probe TLS cert and key from disk #598

lpetrazickisupgrade commented Mar 6, 2024 •

edited

Loading

lpetrazickisupgrade commented Mar 7, 2024 •

edited

Loading

filippog commented Apr 30, 2024

alpine-algo commented Sep 16, 2024 •

edited

Loading

roidelapluie commented Oct 14, 2024 •

edited

Loading

roidelapluie commented Oct 14, 2024

Prometheus failing to reload Probe TLS cert and key from disk #598

Prometheus failing to reload Probe TLS cert and key from disk #598

Comments

lpetrazickisupgrade commented Mar 6, 2024 • edited Loading

lpetrazickisupgrade commented Mar 7, 2024 • edited Loading

filippog commented Apr 30, 2024

alpine-algo commented Sep 16, 2024 • edited Loading

roidelapluie commented Oct 14, 2024 • edited Loading

roidelapluie commented Oct 14, 2024

lpetrazickisupgrade commented Mar 6, 2024 •

edited

Loading

lpetrazickisupgrade commented Mar 7, 2024 •

edited

Loading

alpine-algo commented Sep 16, 2024 •

edited

Loading

roidelapluie commented Oct 14, 2024 •

edited

Loading