Tuning OpenSearch for HCL Connections
HCL is using OpenSearch as part of its component pack for HCL Connections 8 for metrics and the typeahead functionality. The standard settings for the OpenSearch are rather conservative, and might not work for your environment. In particular if your environment is heavily used, you might run into problems. HCL did provide a possibility in their Helm chart to tune the settings, but didn’t document them, so let’s help them a bit.
The problem
The status of our OpenSearch cluster was always yellow. OpenSearch was unable to create replica’s of 2 of the larger metrics indices and of the quickresults index. In a previous article, I explained how to query your OpenSearch cluster, and I am going to assume that you installed this little sendRequest.sh script on your Kubernetes control plane to query the index.
With the command: sendRequest GET "/_cluster/health?pretty" you can query your cluster status. This is what a healthy cluster looks like:
sendRequest GET "/_cluster/health?pretty"
{
"cluster_name" : "opensearch-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 3,
"discovered_master" : true,
"discovered_cluster_manager" : true,
"active_primary_shards" : 56,
"active_shards" : 135,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
In our case, there were a couple of unassigned shards. If that’s the case, you can use sendRequest GET /_cat/indices?v to see what indexes are affected. The risk of missing replica’s, especially for indexes with only one replica, is that if the primary node for an index crashes, your index might be corrupt and won’t be recoverable, unless you have a backup. You might lose important metrics data as a result.
The cause
To see why your indexes have a status ‘ yellow’, you can run the command sendRequest GET /_cluster/allocation/explain?pretty to get an explanation of what’s going wrong. For us this showed:
[internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [440052426/419.6mb], which is larger than the limit of [429496729/409.5mb]
What this means; during shard recovery (creating a replica) the parent memory circuit breaker reaches its maximum. Recovery tries to process ~420 MB in one shot, while the limit is ~409 MB. As a result, recovery fails, shard stays unassigned, and the cluster status goes yellow.
The circuit breaker limit (409 MB) is configured as a percentage (default ~80%) of the JVM heap size. Fielddata has its own breaker, but its usage also contributes to the parent circuit breaker. These percentages can be set using this statement (in the example below, they are set to 40% and 80%):
sendRequest PUT "/_cluster/settings" -H "Content-Type: application/json" -d '{ "persistent": { "indices.breaker.fielddata.limit": "40%", "indices.breaker.total.limit": "80%" } }'The real problem are not percentages though, but is the low amount of the Java heap size that HCL has set by default.
The solution
The solution, therefore, is to increase this limit. It’s important to know that the assignment of shards is happening in the opensearch-cluster-data pods, so that’s where you need to make the change.
HCL’s Helm chart will look for a value for “opensearchJavaOpts” in your helm values. If this line not found, it will use the default, which is 512MB. I made two changes (in bold) to the default Helm values:
resources:
limits:
cpu: "2"
memory: "6144Mi"
requests:
cpu: "0.5"
memory: "4096Mi"
replicas: 3
opensearchJavaOpts: "-Xms2g -Xmx2g"
So I increased the default Java heap size from 512MB to 2 GB, which for most larger Connections environments is a good place to start. This is on the edge of the previous setting of 5 GB as a limit for the entire pod, so I increased this limit to 6 GB. After this change, you need to rollout the helm chart again:
helm upgrade opensearch-data oci://hclcr.io/cnx/opensearch -i --version <current version in your environment> --namespace connections -f <path to>/opensearch_data.yml --wait --timeout 10m
You can check if it worked through this command: sendRequest GET /_nodes/jvm?pretty
Look for this info for the data-pods:
"jvm" : {
"pid" : 12,
"version" : "21.0.6",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "21.0.6+7-LTS",
"vm_vendor" : "Eclipse Adoptium",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"start_time_in_millis" : 1781019408450,
"mem" : {
"heap_init_in_bytes" : 2147483648,
"heap_max_in_bytes" : 2147483648,
"non_heap_init_in_bytes" : 7667712,
"non_heap_max_in_bytes" : 0,
"direct_max_in_bytes" : 0
},When this is done, get OpenSearch to retry creating the replicas:
sendRequest POST "/_cluster/reroute?retry_failed=true"
Wait for a couple of minutes for OpenSearch to create the replicas. After this, your cluster status should go green.
