Unpacking Observability

Image by Hannah Maxwell

Picture This

Sandy’s company has a high-profile flagship application. The development team has deployed a microservice to Prod. A few hours after the deployment, the company’s Twitter feed is flooded with angry Tweets from users. The anger spreads to the company’s Facebook page too. The app is acting really weird. It’s not an issue that’s been seen before, so none of the usual alerts get triggered. Sandy is on-call, and therefore gets paged to look into this issue. Sandy starts troubleshooting.

A Better Way

Sandy’s company has a high-profile flagship application. The development team has deployed a microservice to Prod. The company has Observability practices and tooling in place, so right after deployment, they check the app’s health by logging into the Observability tool. Sandy has gone into the system enough times to know what normal looks like, and right away, they notice that something looks out of place. They do a bit of digging, and within 10 minutes, they find the culprit. Damn. That was a weird use case, and if left unchecked, the company would’ve ended up with a lot of unhappy users in a few hours’ time.

  • It enabled Sandy to troubleshoot quickly. Sandy had never encountered this particular issue before, but having proper instrumentation in place enabled them to quickly identify the culprit.
  • It gave Sandy their sanity back! On-call doesn’t have to be a horribly stressful experience if you have Observability on your side!

What is Observability?

Observability (or o11y, for short) is a paradigm shift. Much of the literature out there talks about Observability in terms of the so-called “Three Pillars”: logs, metrics, and traces.

“You can understand the inner workings of a system […] by asking questions from the outside […], without having to ship new code every time. It’s easy to ship new code to answer a specific question that you found that you need to ask. But instrumenting so that you can ask any question and understand any answer is both an art and a science, and your system is observable when you can ask any question of your system and understand the results without having to SSH into a machine.”

In a nutshell, Observability lets you easily deal with unknown unknowns. To help achieve this, you must instrument your code properly. If you instrument your code properly, you don’t need to keep adding log lines (and therefore redeploying your code) every time there is an issue, just to figure out what’s happening in Prod. You should also not be SSHing into a machine as a first line of defence. That should be your very last resort.

Observability Best Practices

Observability is a real paradigm shift, and it takes a while to wrap your head around it, so don’t expect to do it perfectly right out of the gate. Like all new things, it’s an iterative process, and there will be some failures. But we learn best from failure, right? So don’t be hard on yourself. Start with some of the guidelines below to help you succeed in your Observability journey.

Focus on Observability-Driven Development (ODD)

Just like Test-Driven Development (TDD) puts an emphasis on writing unit tests as you write code, Observability-Driven Development puts an emphasis on instrumenting as you code. Get your developers in the habit of instrumenting as they code.

Instrument, instrument, instrument!

Instrumenting your code is super-important if you want to practice ODD. Getting your instrumentation right is also important. While it might require some tweaking as you go along, here are some guidelines that you and your team can follow when instrumenting code:

  • Events should be wide. This means that you should send as much info in one log line, rather than break it up into many log lines. “Measure everything, and figure out what matters later.” (o11ycast, Episode 18, 18:03)
  • Traces should be deep. As stated in the Guide to Achieving Observability, “Tracing shows the relationships among various services and pieces in a distributed system, and tying them together helps give a more holistic view of what’s happening in production.”
  • Instrument the stuff that’s broken first. Wouldn’t you want to tackle the lowest-hanging fruit first? No, you wouldn’t. Because the lowest-hanging fruit isn’t the stuff that causes your SREs to keep getting paged in the middle of the night. That’s not the stuff that makes your customers send you angry Tweets and Facebook messages.

Know what “Normal” looks like in Prod

As we saw in the second scenario of our use case, Sandy checked the Observability tool right after the app was deployed to Prod. In doing so, they were able to identify an issue before it became catastrophic. Rule of thumb: When you deploy your code to prod, look at it. Don’t wait for bad things to happen.

By Jahobr — Own work, CC0

Get rid of the noise

I recently attended an Observability vendor presentation in which the vendor proudly boasted about a feature that filters through log noise. I was in utter shock, because my immediate thought was that it’s catering to bad practices. Observability is a paradigm shift, and as part of that, it requires refactoring logs so that you don’t need to query your log data. Is that easy to do? No. But nothing worth doing ever is.

Choose the right tool for the job

There are many tools out there claiming to be “Observability tools”, but not all of them are. That’s why it’s important to choose the right one. I can’t tell you what tool to choose, but I can tell you that a good Observability tool will help answer the following questions:

  • How well does the tool do at answering questions you didn’t even know you had (i.e. unknown unknowns)?
  • Does the tool enable you to be proactive (i.e. does it help you identify things in Prod before they become an issue for your customers)?
  • Does the tool replace having to stitch a bunch of separate tools together to allow you to achieve Observability?

Where do Monitoring, Alerting, and Metrics fit in?

Monitoring

According to the Guide to Achieving Observability, “Monitoring systems collect, aggregate, and analyze periodic metrics to systematically sift through known patterns that indicate failures might be occurring. Observability takes a different approach that allows you to identify new and unexpected failures.”

Alerting

Alerts are usually triggered when a certain threshold is reached. For example: low disk space, high CPU, high RAM. There will always be a need for alerts, but if we have too many alerts, it becomes overwhelming, and it becomes hard to tell what’s important and what’s not.

Metrics

Metrics measure something. Which means you have to know what you’re measuring. Metrics require foresight into what’s going to happen later on. (o11ycast Episode 18, 07:23) In the world of Observability, we’re dealing with unknown unknowns, and therefore we don’t know what we’re measuring. Which means…hasta la vista, metrics!

Final Thoughts

Observability is really hard to wrap your head around. I hope that this post has clarified some things around Observability for you. I have to admit that it was self-serving too. I’ve been taking a lot of notes on Observability lately, and I wanted to organize them into a cohesive narrative. 😊

  • Observability-Driven Development is the practice of instrumenting as you code.
  • When we instrument our code, we should focus on wide Events and deep Traces.
  • Check your Observability system often, so you know what “normal” looks like, and check your Observability system after you deploy to Prod, so that you can identify issues before they become problematic.
  • OpenTelemetry is a great way to instrument your code, as it’s open-source and vendor agnostic.
  • Observability leaves little room for metrics and monitoring, as these deal with known unknowns, and Observability deals with unknown unknowns.
  • Good Observability practices reduce your alerts to the meaningful ones only.

“I can’t predict it, and I’m not even gonna try”

I shall now reward you with a picture of some cows chillin’ in France.

Photo by Stijn te Strake on Unsplash

Related Reading

Be sure to check out my follow-up post in the Unpacking Observability series:

References & Resources

I push the boundaries of software delivery by learning and surrounding myself with smart people who challenge the status quo. Former corporate automaton.