Dude, Where’s My Error?
co-written with Reese Lee
Depending on the language(s) you’re used to developing in, you may have certain ideas about what an error is, as well as what constitutes an exception and how it should be handled. For example, Go does not have exceptions, partly to discourage programmers from labeling too many ordinary errors as exceptional. On the other hand, languages such as Java and Python provide built-in support for throwing and catching exceptions.
When you begin at a place where different languages disagree about what an error or exception is and how to handle them, what do you use when you need standardized telemetry and error reporting across microservices written in those languages? OpenTelemetry is the tool with which we’ll address the following, and more:
- How an error is visualized in a backend may not be where you think it’ll be, or how you expect it to look.
- How span kind affects error reporting.
- Errors reported by spans vs. logs.
What is OpenTelemetry?
OpenTelemetry (OTel for short), is a project of the Cloud Native Computing Foundation (CNCF). It’s an open source, vendor-neutral observability framework for instrumenting, generating, collecting, and exporting telemetry data.
Errors versus exceptions
So how does OTel deal with all these conceptual differences across languages? This is where the specification (or “spec” for short) comes in. The spec provides a blueprint for developers working on various parts of the project and standardizes implementation across all languages.
Since language APIs and SDKs are implementations of the spec, there are general rules against implementing anything that isn’t covered in the spec. This provides a guiding principle to help organize contributions to the project. In practice, there are a few exceptions; for example, a language might prototype a new feature as part of adding it to the spec, but the feature may be published (usually as alpha or experimental) before the corresponding language is added.
Another exception is when a language might decide to diverge from the spec. Although it’s generally not advised, sometimes there are strong language-specific reasons to do something different. In this way, the spec allows for some flexibility for each language to implement features as idiomatically as possible. For example, most languages have implemented RecordException
, while Go has implemented RecordError
, which does the same thing.
You can view this compliance matrix of the spec across all languages, but you’ll get the most updated info by checking the individual language repo. Now we have a place from which to begin figuring out how to handle errors in OTel, starting with how to report them in:
- Spans
- Logs
Handling errors in OTel
So how does OTel deal with all these conceptual differences across languages? This is where the specification (or “spec” for short) comes in. The spec provides a blueprint for developers working on various parts of the project, and standardizes implementation across all languages.
Since language APIs and SDKs are implementations of the spec, there are general rules against implementing anything that isn’t covered in the spec. This provides a guiding principle to help organize contributions to the project. In practice, there are a few exceptions; for example, a language might prototype a new feature as part of adding it to the spec, but the feature may be published (usually as alpha or experimental) before the corresponding language is added.
Another exception is when a language might decide to diverge from the spec. Although it is generally not advised, sometimes there are strong language-specific reasons to do something different. In this way, the spec allows for some flexibility for each language to implement features as idiomatically as possible. For example, most languages have implemented RecordException
(for example, Python), while Go has implemented RecordError
, which does the same thing.
You can view this compliance matrix of the spec across all languages, but you’ll get the most updated info by checking the individual language repository. Now we can begin figuring out how to handle errors in OTel, starting with how to report them:
- Spans
- Logs
Errors in spans
In OTel, spans are the building blocks of distributed traces, representing individual units of work within a distributed system. Spans are related to each other and to a trace through context. Put simply, context is the glue that turns a pack of data into a unified trace. Context propagation allows us to pass information across multiple systems, therefore tying them together. Traces can tell us all sorts of things about our applications through metadata and span events.
Enhancing spans with metadata
OTel enables you to enhance spans with metadata (attributes) in the form of key-value pairs. By attaching relevant information to spans, such as user IDs, request parameters, or environment variables, you can gain deeper insights into the circumstances surrounding an error and quickly identify its root cause. This metadata-rich approach to error handling can significantly reduce the time and effort required to diagnose and resolve issues, ultimately improving the reliability and maintainability of your applications.
Spans also have a span kind field, which gives us some additional metadata that can help developers troubleshoot errors. OTel defines several span kinds, each of which has unique implications for error reporting:
- client: For outgoing synchronous remote calls; for example, outgoing HTTP request or DB call.
- server: For incoming synchronous remote calls; for example, incoming HTTP request or remote procedure call.
- internal: For operations that do not cross process boundaries; for example, instrumenting a function call.
- producer: For the creation of a job which may be asynchronously processed later; for example, job inserted into a job queue.
- consumer: For the processing of a job created by a producer, which may start long after the producer span has ended.
Span kind is determined automatically by the instrumentation libraries used.
Spans can be further enhanced with span status. By default, span status is marked as Unset
unless otherwise specified. You can mark a span status as Error
if the resulting span depicts an error, and Ok
if the resulting span is error-free.
Enhancing spans with span events
A span event is a structured log message embedded within a span. Span events help enhance spans by providing descriptive information about a span. Span events can also have attributes of their own. New Relic synthesizes span events as its own data type called a SpanEvent
.
When a span status is set to Error
, a span event is created automatically, capturing the span’s resulting error message and stack trace as an event on that span. You can further enhance this span error by adding attributes to it.
Earlier, we mentioned a method called RecordException
. Per the spec (emphasis our own), “To facilitate recording an exception languages SHOULD provide a RecordException method if the language uses exceptions. … The signature of the method is to be determined by each language and can be overloaded as appropriate.”
Since Go doesn’t support the “conventional” concept of exceptions, it instead supports RecordError, which essentially does the same thing idiomatically. However, you have to make an additional call to set its status to Error
if that’s what it should be, as it won’t automatically be set to that. Similarly, RecordException
can be used to record span events without setting the span’s status to Error
, which means you can use it to record any additional data about a span.
By decoupling the span status from being automatically set to Error
when a span exception occurs, you can support the use case where you can have an exception event with a status of Ok
or Unset
. This gives instrumentation authors the most flexibility.
Errors depicted in logs
In OTel, a log is a structured, timestamped message emitted by a service or other component. The recent addition of logs to OTel gives us yet another way of reporting errors. Logs have traditionally had different severity levels for representing the type of message being emitted, such as DEBUG
, INFO
, WARNING
, ERROR
, and CRITICAL
.
OTel allows for the correlation of logs to traces in which a log message can be associated to a span within a trace, via trace context correlation. Hence, looking for a log message with a log level of ERROR
or CRITICAL
can yield further information of what led to that error by pulling up the correlated trace.
To record an error on a log, either exception.type
or exception.message
is required, while exception.stacktrace
is recommended. You can view more information about the semantic conventions for log exceptions here.
Logs or spans to capture errors?
After all this, you might be wondering which signal to use to capture errors: spans or logs? The answer is: “It depends!” Perhaps your team primarily uses traces; perhaps it primarily uses logs.
Spans can be great for capturing errors, because if the operation errors out, marking a span as an error makes it stand out and therefore easier to spot. On the other hand, if you’re not filtering or tail sampling your traces and your system is producing thousands of spans per minute, you could miss errors that aren’t occurring frequently, but that still need to be handled.
What about using span events versus logs? Again, this depends. It may be convenient to use span events, because when a span status is set to Error
, a span event with the exception message (and other metadata you may wish to capture) is automatically created.
Another consideration is your observability backend. Does your backend render both logs and traces? How easily queryable or discoverable are your logs, spans, and span events? Is logs and trace correlation supported?
Visualizing errors in different backends
While OTel provides us with the raw telemetry data emitted by our systems, it doesn’t provide data visualization or interpretation. This is done by an observability backend. Because OTel is vendor-neutral, it means that the same information emitted can be visualized and interpreted by different backends without re-instrumenting your application.
Jaeger
Let’s take a look at what an OTel error looks like in Jaeger. The error data was generated by the code in this repository. Here is a trace view for the service py-otel-server. As you can see below, the error spans show up as red dots:
And if we drill down and zero in on the error span, we can click into Logs
, which is how span events are expressed in Jaeger, and view the information that was captured on it:
The span is clearly marked as error, and includes a span event with the exception captured. Jaeger expresses the span event as a log, but does not visualize logs outside of spans.
Proprietary backends
If you’ve been using a proprietary agent to monitor your applications and have recently migrated to OTel, you might notice that an OTel error may not be expressed the way you expect in your observability backend, as compared to the same error captured by the proprietary agent. This is most likely due to the fact that OTel simply models errors differently than how vendors have been modeling them.
As a broad example, vendors might have their own notion of what constitutes a logical unit of work in an application. You may be familiar with the term transaction
, which means something slightly different from vendor to vendor. In OTel, this is represented by a trace. You’ve likely noticed differences in your data visualization experiences as vendors make their own adjustments to their platforms to accommodate OTLP as a first-class citizen data type.
As a more specific example, OTel’s notion of span kinds may affect how your OTel error is expressed in your backend. For instance, if you have a trace that has one exception and it’s on an internal span with its status set to Error
, you should see the trace marked with an error, but it may not be counted toward your overall app error rate. This is because the vendor might have an opinion that only errors on entry point spans (server spans) and consumer spans should be counted toward your error rate.
If your backend supports trace and log correlation, you should be able to navigate to the associated trace from the log, and vice versa. Furthermore, while Jaeger visualizes span events as logs, some vendors might synthesize span events as its own data type instead of as a log data type, which would impact the way you query that data.
To give you an idea of what this can look like, let’s look at the same spans that we saw in Jaeger, in two different vendor backends: ServiceNow Cloud Observability (SNCO) and NewRelic.
ServiceNow Cloud Observability
The screen shot below shows a high-level view of spans for the do_roll
operation in ServiceNow Cloud Observability (SNCO) via Notebooks. There’s a mixture of green dots and red triangles. The green dots represent spans without errors. The red triangles represent spans with errors.
If we drill down and click on one of the red triangles, we get this trace waterfall view of the span. Because the span was marked “error”, it is depicted as red in the UI. You can also see the associated span error (span event), just like we saw in Jaeger:
Here is an example of an error log. Because the log is marked as “error”, it shows up as red in the UI. Its correlated trace ID is also listed, so you can see in what trace this error originated.
Here’s a screenshot of the above log relative to other logs captured. You can also see the other logs that are a part of the same trace:
New Relic
In New Relic, you can click into the trace group, called send_requests
, and see which traces contain spans with errors:
Similar to what you see in SNCO, selecting one of the error traces allows you to see a trace waterfall of all the spans involved. We can see that our do_roll
span contains an exception as a span event. You can also click Attributes to see the associated metadata, including any custom attributes you’ve added:
You can click into the span events to view further details about the exception, as well as any custom attributes you’ve added to your span event — in this case, you can see our example dummy attributes. Because we’re manually recording a span event, you will also see a second span event captured in addition to the exception:
You can access any correlated logs directly from the trace, by clicking Logs from the selected trace. Here, you’ll see three logs, and at which point in the trace they were generated. The chart shows how many were error logs:
Conclusion
We’ve just explored the challenges of handling errors and exceptions across different programming languages within a microservices architecture, and introduced OTel as a solution for standardized telemetry and error reporting. The OTel specification serves as a blueprint for standardizing error handling across various languages, providing guidelines for implementation, but allowing for a degree of flexibility.
You can record errors on spans by making use of your language SDK’s RecordException
or its equivalent, and enrich the span events further by adding custom attributes. You can also record errors on logs by adding exception.type
or exception.message
, and capture the stack trace by adding exception.stacktrace
to yield further information about what happened.
Once that data is in your observability backend, if you have previously used their proprietary monitoring agent, you might notice that there is a difference in how OTel-instrumented errors are visualized versus how the agent-instrumented errors were visualized. This is mainly because OTel models errors differently than vendors might have previously done.
By leveraging OTel’s capabilities to record errors through logs and spans and to enhance them with metadata, you can gain deeper insights into your applications’ behavior and more effectively troubleshoot issues. You’ll be better equipped to build and maintain resilient, reliable, and high-performing software applications in today’s dynamic and demanding environments. Learn more about error handling in OTel here.
Additional Resources
If you’d like to see a full-fledged example of OTel Errors in action, check out Adriana’s GitHub repo here. It features a Python example which illustrates errors in OTel. You can also check out the accompanying explainer video here.
You can also catch the talk version of this blog post, that Adriana and Reese did at Observability Day EU 2024 in Paris below: