Rebuild hawkular metrics after cluster deploy on OpenShift 3.11.

All operations are performed on the management node.

Assume that there was a change that requires to execute deploy cluster playbook.

$ ansible-playbook -i hosts playbooks/deploy_cluster.yml

Now, there are no metrics.

Log in to the OpenShift cluster.

$ oc login https://openshift-example.example.org:8443 --token=GWJqvER5-N3PBEGt14bwjx9K39ztqUqoUGQxki19kud -n openshift-infra
Logged into "https://openshift-example.example.org:8443" as "admin" using the token provided.
You have access to the following projects and can switch between them with 'oc project <projectname>':
    default
    development-milosz
    kube-public
    kube-service-catalog
    kube-system
    management-infra
    openshift
    openshift-ansible-service-broker
    openshift-console
    openshift-descheduler
  * openshift-infra
    openshift-logging
    openshift-metrics-server
    openshift-monitoring
    openshift-node
    openshift-sdn
    openshift-template-service-broker
    openshift-web-console
    ops-view
Using project "openshift-infra".

Notice, hawkular-metrics is not running.

$ oc get pods -n openshift-infra
NAME                            READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-kwgrf      1/1       Running   0          10m
hawkular-metrics-jxr42          0/1       Running   1          10m
hawkular-metrics-schema-psqgr   1/1       Running   0          10m
heapster-ckfpw                  0/1       Running   1          10m

Everything here is falling apart.

$ oc logs --tail=10 hawkular-metrics-jxr42 -n openshift-infra
2020-04-15 10:17:38,031 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
2020-04-15 10:17:38,031 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly Full 11.0.0.Final (WildFly Core 3.0.8.Final) started in 9055ms - Started 343 of 593 services (340 services are lazy, passive or on-demand)
2020-04-15 10:17:47,406 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:17:47,406 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:17:57,410 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:17:57,411 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:18:07,414 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:18:07,414 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:18:17,417 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:18:17,417 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
$ oc logs --tail=15 hawkular-cassandra-1-kwgrf -n openshift-infra
Caused by: javax.net.ssl.SSLHandshakeException: null cert chain
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[na:1.8.0_181]
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1666) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:330) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:318) ~[na:1.8.0_181]
	at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1935) ~[na:1.8.0_181]
	at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:237) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1052) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:992) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:989) ~[na:1.8.0_181]
	at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1467) ~[na:1.8.0_181]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1256) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1169) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
	... 16 common frames omitted
$ oc logs --tail=10 hawkular-metrics-schema-psqgr -n openshift-infra
INFO  2020-04-15 10:22:06,740 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:06,746 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:11,746 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:11,755 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:16,755 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:16,761 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:21,761 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:21,768 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:26,769 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:26,775 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
$ oc logs --tail=10 heapster-ckfpw -n openshift-infra
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.

Hawkular metrics route will return 503 Service Unavailable.

$ curl --head https://hawkular-metrics.openshift-example.example.org/hawkular/metrics
HTTP/1.0 503 Service Unavailable
Pragma: no-cache
Cache-Control: private, max-age=0, no-cache, no-store
Connection: close
Content-Type: text/html

To rebuild hawkular metrics delete its hawkular-metrics-schema job.

$ oc get jobs -n openshift-infra
NAME                      DESIRED   SUCCESSFUL   AGE
hawkular-metrics-schema   1         0            19m
$ oc delete jobs/hawkular-metrics-schema -n openshift-infra
job.batch "hawkular-metrics-schema" deleted

Execute openshift-metrics/schema.yml playbook.

$ ansible-playbook -i hosts  playbooks/openshift-metrics/schema.yml
[...]
PLAY RECAP *******************************************************************************************************************************************************
localhost                              : ok=12   changed=0    unreachable=0    failed=0    skipped=4    rescued=0    ignored=0
openshift-example-infra-1.example.org  : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
openshift-example-lb-1.example.org     : ok=1    changed=0    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
openshift-example-master-1.example.org : ok=53   changed=2    unreachable=0    failed=0    skipped=37   rescued=0    ignored=0
openshift-example-node-1.example.org   : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
openshift-example-node-2.example.org   : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
INSTALLER STATUS *************************************************************************************************************************************************
Initialization  : Complete (0:00:11)
Wednesday 15 April 2020  12:25:24 +0200 (0:00:00.143)       0:00:15.571 *******
===============================================================================
Gathering Facts ------------------------------------------------------------------------------------------------------------------------------------------- 1.75s
openshift_metrics : generate hawkular-metrics schema job -------------------------------------------------------------------------------------------------- 0.76s
Gather Cluster facts -------------------------------------------------------------------------------------------------------------------------------------- 0.64s
get openshift_current_version ----------------------------------------------------------------------------------------------------------------------------- 0.59s
openshift_metrics : Applying /tmp/openshift-metrics-ansible-7Eo6Im/templates/hawkular_metrics_schema_job.yaml --------------------------------------------- 0.54s
openshift_metrics : Checking generation of Job hawkular-metrics-schema ------------------------------------------------------------------------------------ 0.43s
openshift_metrics : Create temp directory for all our templates ------------------------------------------------------------------------------------------- 0.40s
openshift_metrics : Determine change status of Job hawkular-metrics-schema -------------------------------------------------------------------------------- 0.39s
openshift_control_plane : slurp --------------------------------------------------------------------------------------------------------------------------- 0.39s
openshift_metrics : list installed jobs ------------------------------------------------------------------------------------------------------------------- 0.38s
Detecting Operating System from ostree_booted ------------------------------------------------------------------------------------------------------------- 0.38s
openshift_metrics : Create temp directory for doing work in on target ------------------------------------------------------------------------------------- 0.36s
Initialize openshift.node.sdn_mtu ------------------------------------------------------------------------------------------------------------------------- 0.33s
openshift_metrics : Create temp directory local on control node ------------------------------------------------------------------------------------------- 0.25s
Fetch ca.crt from cluster if exists ----------------------------------------------------------------------------------------------------------------------- 0.21s
openshift_metrics : Copy the admin client config(s) ------------------------------------------------------------------------------------------------------- 0.20s
set_fact -------------------------------------------------------------------------------------------------------------------------------------------------- 0.19s
openshift_control_plane : stat ---------------------------------------------------------------------------------------------------------------------------- 0.18s
openshift_metrics : slurp --------------------------------------------------------------------------------------------------------------------------------- 0.18s
openshift_sanitize_inventory : Check for usage of deprecated variables ------------------------------------------------------------------------------------ 0.17s

Inspect hawkular-metrics-schema job.

$ oc get jobs -n openshift-infra
NAME                      DESIRED   SUCCESSFUL   AGE
hawkular-metrics-schema   1         1            15s

Inspect pods.

$ oc get pods -n openshift-infra
NAME                            READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-kwgrf      1/1       Running     0          20m
hawkular-metrics-jxr42          1/1       Running     3          20m
hawkular-metrics-schema-td29l   0/1       Completed   0          1m
heapster-ckfpw                  1/1       Running     2          20m

Hawkular metrics route will return 200 OK.

$ curl --head https://hawkular-metrics.openshift-example.example.org/hawkular/metrics
HTTP/1.1 200 OK
Cache-Control: no-cache
Vary: Origin,Accept-Encoding
X-Powered-By: Undertow/1
Server: WildFly/11
Content-Type: application/json
Content-Length: 132
Date: Wed, 15 Apr 2020 10:27:12 GMT
Set-Cookie: a054b5d9e987bf679f10c9d29be39478=3ce5579d1b00caa62afe078c982aca15; path=/; HttpOnly; Secure

Inspect logs.

$ oc logs --tail=10 hawkular-metrics-jxr42 -n openshift-infra
2020-04-15 10:25:40,102 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-5) gc_grace_seconds for locks is set to 864000. Resetting to 0
2020-04-15 10:25:40,192 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-4) gc_grace_seconds for metrics_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,331 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) gc_grace_seconds for metrics_tags_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,332 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) gc_grace_seconds for retentions_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,391 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-8) gc_grace_seconds for scheduled_jobs_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,497 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-1) gc_grace_seconds for sys_config is set to 864000. Resetting to 0
2020-04-15 10:25:40,497 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-1) gc_grace_seconds for tasks is set to 864000. Resetting to 0
2020-04-15 10:25:40,543 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-3) gc_grace_seconds for tenants is set to 864000. Resetting to 0
2020-04-15 10:25:40,799 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) Finished gc_grace_seconds updates in 1344 ms
2020-04-15 10:27:12,678 WARN  [org.jboss.resteasy.resteasy_jaxrs.i18n] (default task-23) RESTEASY002142: Multiple resource methods match request "HEAD /". Selecting one. Matching methods: [public javax.ws.rs.core.Response org.hawkular.metrics.api.jaxrs.handler.BaseHandler.baseJSON(), public void org.hawkular.metrics.api.jaxrs.handler.BaseHandler.baseHTML(javax.servlet.ServletContext) throws java.lang.Exception]

Everything is fine now.