Tips for Troubleshooting the Target Allocator

Adriana Villela
8 min readJun 25, 2024

--

Glass pyramid at the Louvre
Looking up at the glass pyramid at the Louvre, as seen from the inside. Photo by Adriana Villela.

If you’ve enabled Target Allocator service discovery on the OTel Operator, and the Target Allocator is failing to discover scrape targets, then there are a few troubleshooting steps that you can take to help you understand what’s going on and to get things back on track. I put these together based on some of my own experience. May these help you on your own journey!

Troubleshooting Steps

Before we start, be sure to check out this repo, which, among other things, includes examples of configuring the OpenTelemetryCollector custom resource (CR) to use the Target Allocator’s service discovery functionality, along with examples of ServiceMonitor and PodMonitor resource definitions.

1- Did you deploy all of your resources to Kubernetes?

Okay…you may be laughing at me for how obvious this sounds, but it totally happened to me. In fact, it happened while I was adding the PodMonitor example to my repo.

After checking to see if the service discovery was working per step 2 below (spoiler: it wasn’t), I went through all of the other troubleshooting steps. Except for this one, of course. 🤬 According to the API documentation, all of my configurations looked correct. Yeah…too bad the resource wasn’t actually deployed.

In a flash of inspiration, I decided to check to make sure that the PodMonitor was actually deployed to my Kubernetes cluster, and lo and behold…it was missing. After I deployed the PodMonitor (for realsies, this time), it worked. At least I take comfort in the fact that my configurations were correct the whole time! 🫠

So yeah…moral of the story: make sure you actually deploy your resources.

2- Do you know if metrics are actually being scraped?

After you’ve deployed all of your resources to Kubernetes, make sure that the Target Allocator is discovering scrape targets from your ServiceMonitor(s) or PodMonitor(s).

After you’ve deployed all of your resources to Kubernetes, check to make sure that the Target Allocator is actually picking up your ServiceMonitor(s) and/or PodMonitor(s). Fortunately, you can check this pretty easily.

Let’s suppose that you have this ServiceMonitor definition:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sm-example
namespace: opentelemetry
labels:
app.kubernetes.io/name: py-prometheus-app
release: prometheus
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- opentelemetry
endpoints:
- port: prom
path: /metrics
- port: py-client-port
interval: 15s
- port: py-server-port

and this Service definition:

apiVersion: v1
kind: Service
metadata:
name: py-prometheus-app
namespace: opentelemetry
labels:
app: my-app
app.kubernetes.io/name: py-prometheus-app
spec:
selector:
app: my-app
app.kubernetes.io/name: py-prometheus-app
ports:
- name: prom
port: 8080

First, set up a port-forward in Kubernetes, so that you can expose the Target Allocator service:

kubectl port-forward svc/<otel_collector_resource_name>-targetallocator -n <namespace> 8080:80

Where <otel_collector_resource_name> is the value of metadata.name in your OpenTelemetryCollector CR, and <namespace> is the namespace to which the OpenTelemetryCollector CR is deployed.

NOTE: You can also get the service name by running kubectl get svc -l app.kubernetes.io/component=opentelemetry-targetallocator -n <namespace>.

Based on the example repository, yours would look like this:

kubectl port-forward svc/otelcol-targetallocator -n opentelemetry 8080:80

Next, get a list of jobs registered with the Target Allocator:

curl localhost:8080/jobs | jq

Your sample output should look something like this:

{
"serviceMonitor/opentelemetry/sm-example/1": {
"_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F1/targets"
},
"serviceMonitor/opentelemetry/sm-example/2": {
"_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F2/targets"
},
"otel-collector": {
"_link": "/jobs/otel-collector/targets"
},
"serviceMonitor/opentelemetry/sm-example/0": {
"_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F0/targets"
},
"podMonitor/opentelemetry/pm-example/0": {
"_link": "/jobs/podMonitor%2Fopentelemetry%2Fpm-example%2F0/targets"
}
}

Where serviceMonitor/opentelemetry/sm-example/0 represents one of the Service ports that the ServiceMonitor picked up:

  • opentelemetry is the namespace in which the ServiceMonitor resource resides
  • sm-example is the name of the ServiceMonitor
  • 0 is one of the port endpoints matched between the ServiceMonitor and the Service

We see a similar story with the PodMonitor, which shows up as podMonitor/opentelemetry/pm-example/0 in the curl output.

This is good news, because it tells us that the scrape config discovery is working!

You might also be wondering about the otel-collector entry. You might also be wondering about the otel-collector entry. This is happening because spec.config.receivers.prometheusReceiver in the example OpenTelemetryCollector resource (which is named otel-collector) has self-scrape enabled:

prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: [ '0.0.0.0:8888' ]

We can take a deeper look into serviceMonitor/opentelemetry/sm-example/0, to see what scrape targets are getting picked up by running curl against the value of the _link output above:

curl localhost:8080/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F0/targets | jq

Sample output:

{
"otelcol-collector-0": {
"_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F1/targets?collector_id=otelcol-collector-0",
"targets": [
{
"targets": [
"10.244.0.11:8082"
],
"labels": {
"__meta_kubernetes_endpointslice_name": "py-otel-client-svc-znvrz",
"__meta_kubernetes_pod_label_app": "my-app",
"__meta_kubernetes_pod_node_name": "otel-target-allocator-talk-control-plane",
"__meta_kubernetes_endpointslice_label_endpointslice_kubernetes_io_managed_by": "endpointslice-controller.k8s.io",
"__meta_kubernetes_service_labelpresent_app": "true",
"__meta_kubernetes_endpointslice_address_target_kind": "Pod",
"__meta_kubernetes_endpointslice_endpoint_conditions_terminating": "false",
"__meta_kubernetes_pod_container_port_number": "8082",
"__meta_kubernetes_endpointslice_labelpresent_app": "true",
"__meta_kubernetes_pod_label_pod_template_hash": "776d6686bb",
"__meta_kubernetes_pod_container_image": "otel-target-allocator-talk:0.1.0-py-otel-client",
"__meta_kubernetes_pod_ip": "10.244.0.11",
"__meta_kubernetes_pod_controller_name": "py-otel-client-776d6686bb",
"__meta_kubernetes_pod_controller_kind": "ReplicaSet",
"__meta_kubernetes_pod_label_app_kubernetes_io_name": "py-otel-client",
"__meta_kubernetes_endpointslice_annotationpresent_endpoints_kubernetes_io_last_change_trigger_time": "true",
"__meta_kubernetes_service_annotationpresent_kubectl_kubernetes_io_last_applied_configuration": "true",
"__meta_kubernetes_pod_ready": "true",
"__meta_kubernetes_endpointslice_endpoint_conditions_serving": "true",
"__meta_kubernetes_pod_annotation_instrumentation_opentelemetry_io_inject_python": "true",
"__meta_kubernetes_endpointslice_port_protocol": "TCP",
"__meta_kubernetes_endpointslice_label_app": "my-app",
"__meta_kubernetes_pod_name": "py-otel-client-776d6686bb-7mchc",
"__meta_kubernetes_pod_annotationpresent_instrumentation_opentelemetry_io_inject_python": "true",
"__meta_kubernetes_endpointslice_endpoint_conditions_ready": "true",
"__meta_kubernetes_pod_host_ip": "172.24.0.2",
"__meta_kubernetes_namespace": "opentelemetry",
"__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
"__meta_kubernetes_endpointslice_port_name": "py-client-port",
"__meta_kubernetes_pod_phase": "Running",
"__meta_kubernetes_endpointslice_label_app_kubernetes_io_name": "py-otel-client",
"__meta_kubernetes_endpointslice_port": "8082",
"__meta_kubernetes_endpointslice_address_target_name": "py-otel-client-776d6686bb-7mchc",
"__meta_kubernetes_pod_container_name": "py-otel-client",
"__meta_kubernetes_pod_container_port_name": "py-client-port",
"__meta_kubernetes_endpointslice_address_type": "IPv4",
"__meta_kubernetes_pod_uid": "bd68fa78-13f6-4377-bcfd-9bb95553f1f4",
"__meta_kubernetes_service_name": "py-otel-client-svc",
"__meta_kubernetes_service_label_app_kubernetes_io_name": "py-otel-client",
"__meta_kubernetes_pod_labelpresent_app": "true",
"__meta_kubernetes_service_labelpresent_app_kubernetes_io_name": "true",
"__meta_kubernetes_endpointslice_label_kubernetes_io_service_name": "py-otel-client-svc",
"__meta_kubernetes_endpointslice_annotation_endpoints_kubernetes_io_last_change_trigger_time": "2024-06-14T21:04:36Z",
"__address__": "10.244.0.11:8082",
"__meta_kubernetes_endpointslice_labelpresent_kubernetes_io_service_name": "true",
"__meta_kubernetes_endpointslice_labelpresent_endpointslice_kubernetes_io_managed_by": "true",
"__meta_kubernetes_service_annotation_kubectl_kubernetes_io_last_applied_configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Service\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"my-app\",\"app.kubernetes.io/name\":\"py-otel-client\"},\"name\":\"py-otel-client-svc\",\"namespace\":\"opentelemetry\"},\"spec\":{\"ports\":[{\"name\":\"py-client-port\",\"port\":8082,\"protocol\":\"TCP\",\"targetPort\":\"py-client-port\"}],\"selector\":{\"app.kubernetes.io/name\":\"py-otel-client\"}}}\n",
"__meta_kubernetes_pod_labelpresent_app_kubernetes_io_name": "true",
"__meta_kubernetes_pod_container_port_protocol": "TCP",
"__meta_kubernetes_service_label_app": "my-app",
"__meta_kubernetes_endpointslice_labelpresent_app_kubernetes_io_name": "true"
}
}
]
}
}

NOTE: The query parameter collector_id in the _link field of the output above states that these are the targets pertain to otelcol-collector-0 (the name of the StatefulSet created for the OpenTelemetryCollector resource).

PS: Shoutout to this blog post for educating me about this troubleshooting technique.

3- Is the Target Allocator enabled? Is Prometheus service discovery enabled?

If the curl commands above don’t show a list of expected ServiceMonitors and PodMonitors, then it’s time to dig a bit deeper.

One thing to remember is that just because you include the targetAllocator section in the OpenTelemetryCollector CR doesn’t mean that it’s enabled. You need to explicitly enable it. Furthermore, if you want to use Prometheus service discovery, you must explicitly enable it:

  • Set spec.targetAllocator.enabled to true
  • Set spec.targetAllocator.prometheusCR.enabled to true

So that your OpenTelemetryCollector resource looks like this:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otelcol
namespace: opentelemetry
spec:
mode: statefulset
targetAllocator:
enabled: true
serviceAccount: opentelemetry-targetallocator-sa
prometheusCR:
enabled: true
...

📝 See the full OpenTelemetryCollector resource definition.

4- Did you configure a ServiceMonitor (or PodMonitor) selector?

If you configured a ServiceMonitor selector, it means that the Target Allocator only looks for ServiceMonitors having a metadata.label that matches the value in serviceMonitorSelector.

Suppose that you configured a serviceMonitorSelector for your Target Allocator, like in the following example:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otelcol
namespace: opentelemetry
spec:
mode: statefulset
targetAllocator:
enabled: true
serviceAccount: opentelemetry-targetallocator-sa
prometheusCR:
enabled: true
serviceMonitorSelector:
matchLabels:
app: my-app
...

By setting the value of spec.targetAllocator.prometheusCR.serviceMonitorSelector.matchLabels to app: my-app, it means that your ServiceMonitor resource must in turn have that same value inmetadata.labels:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sm-example
labels:
app: my-app
release: prometheus
spec:
...

📝 For more detail, see the ServiceMonitor resource definition.

In this case, the OpenTelemetryCollector resource's spec.targetAllocator.prometheusCR.serviceMonitorSelector.matchLabels is looking only for ServiceMonitors having the label app: my-app, which we see in the previous example.

If your ServiceMonitor resource is missing that label, then the Target Allocator will fail to discover scrape targets from that ServiceMonitor.

NOTE: The same applies if you’re using a PodMonitor. In that case, you would use a podMonitorSelector instead of a serviceMonitorSelector.

5- Did you leave out the serviceMonitorSelector and/or podMonitorSelector configuration altogether?

As we learned above, setting mismatched values for serviceMonitorSelector and podMonitorSelector results in the Target Allocator failing to discover scrape targets from your ServiceMonitors and PodMonitors, respectively.

Similarly, in v1beta1 of the OpenTelemetryCollector CR, leaving out this configuration altogether also results in the Target Allocator failing to discover scrape targets from your ServiceMonitors and PodMonitors.

As of v1beta1 of the OpenTelemetryOperator, you must include a serviceMonitorSelector and podMonitorSelector, even if you don’t intend to use it, like this:


prometheusCR:
enabled: true
podMonitorSelector: {}
serviceMonitorSelector: {}

This configuration means that it will match on all PodMonitor and ServiceMonitor resources. See the full example.

I just learned this today, as I was updating my OpenTelemetryCollector YAML from v1alpha1 to v1beta1.

6- Do your labels, namespaces, and ports match for your ServiceMonitor and your Service (or PodMonitor and your Pod)?

The ServiceMonitor is configured to pick up Kubernetes Services that match on:

  • Labels
  • Namespaces (optional)
  • Ports (endpoints)

Suppose that you have this ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sm-example
labels:
app: my-app
release: prometheus
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- opentelemetry
endpoints:
- port: prom
path: /metrics
- port: py-client-port
interval: 15s
- port: py-server-port

The previous ServiceMonitor is looking for any services that have:

  • the label app: my-app
  • reside in a namespace called opentelemetry
  • a port named prom, py-client-port, or py-server-port

So for example, the following Service resource would get picked up by the ServiceMonitor, because it matches the above criteria:

apiVersion: v1
kind: Service
metadata:
name: py-prometheus-app
namespace: opentelemetry
labels:
app: my-app
app.kubernetes.io/name: py-prometheus-app
spec:
selector:
app: my-app
app.kubernetes.io/name: py-prometheus-app
ports:
- name: prom
port: 8080

Conversely, the followingService resource would NOT, because the ServiceMonitor is looking for ports named prom, py-client-port, or py-server-port, and this Service’s port is called bleh.

apiVersion: v1
kind: Service
metadata:
name: py-prometheus-app
namespace: opentelemetry
labels:
app: my-app
app.kubernetes.io/name: py-prometheus-app
spec:
selector:
app: my-app
app.kubernetes.io/name: py-prometheus-app
ports:
- name: bleh
port: 8080

NOTE: If you’re using PodMonitor, the same applies, except that it picks up Kubernetes pods that match on labels, namespaces, and named ports. For example, see this PodMonitor resource definition.

Final Thoughts

With a little know-how, troubleshooting Target Allocator issues goes from scary to manageable. And don’t forget to actually deploy your resources first, to save yourself a lot of heartache and embarrassment. 🫥

I’d also like to add that I have contributed this guide to the OTel docs, because I think that contributing stuff like this back to the source of truth for open source projects is important.

If you’d like to dig into other aspects of the OpenTelemetry Operator, such as OTel Operator’s auto-instrumentation capability, along with some troubleshooting tips, be sure to check out my post on this topic. I’ve also got a PR on the troubleshooting guide for this.

And now, I will leave you with a rare photo in which you can see both of my rats, Katie and Buffy, TOGETHER! Pardon the fuzziness. It’s a screen cap of a video. 🙃

Man in maroon shirt cradling two rats in his hands. The rat on the left is light brown. The rat on the right is very dark brown.
Katie and Buffy were still(ish) enough for a photo op together. Photo by Adriana Villela.

Until next time, peace, love, and code. ✌️💜👩‍💻

--

--

Adriana Villela

DevRel | OTel End User SIG Maintainer | {CNCF,HashiCorp} Ambassador | Podcaster | 🚫BS | Speaker | Boulderer | Computering 20+ years | Opinions my own 🇧🇷🇨🇦