Using Open Telemetry to Create Web-Based Service Level Objectives

More by Lukasz Ciolek:

Introduction

OpenTelemetry is a powerful open-source observability framework that enables organizations to collect, process, and analyze telemetry data across distributed systems. Its standardized approach to gathering metrics, logs, and traces makes it easier to monitor performance and detect issues in real-time. This is particularly interesting for defining Service Level Objectives (SLOs) because it allows teams to set reliability targets based on end-user impact data from each part of the application. This makes OpenTelemetry a valuable foundation for building meaningful SLOs that reflect real user experience and help teams make smarter, faster decisions about reliability.

System health and performance are not just technical concerns. They directly impact the customer experience and, ultimately, the success of the business. When a website or app is slow, unreliable, or unavailable, customers notice and leave. But it’s not just the obvious outages or major failures that drive people away. Small delays, inconsistent behavior, or moments where the service doesn’t work quite the way a user expects can chip away at trust and satisfaction. These subtle issues often go unnoticed by the business until it’s too late. By using OpenTelemetry to collect real-time data and pairing it with Service Level Objectives (SLOs), companies can turn both the big problems and the small annoyances into measurable signals. This allows teams across the organization, not just engineers, to monitor and improve the entire customer experience, down to the fine details.

Running local OTel demo

ℹ️
If you are already familiar with the OpenTelemetry stack, you can skip ahead to the next section, where we demonstrate examples of defining meaningful web service-level objectives.

Note: the official OTel demo running locally on a Macbook was used for this example.

Running a local OpenTelemetry demo allows teams to experiment in a controlled environment before applying it to real production systems. This helps ensure the right data is collected to measure and improve reliability without affecting actual users. Essentially, we want to test what we are doing without disrupting users.

Install the demo using Helm Chart

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install my-otel-demo open-telemetry/opentelemetry-demo

Once installed, you can access the application via your browser, but first, you need to port-forward the front-end proxy service.

kubectl port-forward svc/my-otel-demo-frontendproxy 8080:8080

Now the application should be available:

b515cf6a-1e50-4534-a6c7-9b260a9027e6

Configuring OpenSearch as Jaeger storage

We will make one change to the demo application in order to set OpenSearch as the storage layer for Jaeger. This change minimizes the amount of configuration work needed on the Nobl9 side.

my-otel-demo-jaeger deployment has SPAN_STORAGE_TYPE and ES_SERVER_URLS set as below.

// my-otel-demo-jaeger deployment
- name: SPAN_STORAGE_TYPE
  value: elasticsearch
- name: ES_SERVER_URLS
  value: http://otel-demo-opensearch:9200/

Elasticsearch was used primarily due to the fact it is compatible with OpenSearch, and OpenSearch is currently part of the OTel demo K8s cluster. Cassandra, Elasticsearch, and OpenSearch are the primary supported storage backends. Nobl9 is the only platform that can manage data agnostically, meaning you can import your data via one of our out-of-the-box integrations, or by agent, and ElasticSearch/OpenSearch is one of the integration options.

ℹ️
The OpenSearch deployment configuration is not production ready. Indexes have yellow status. It is not able to satisfy replication factor.

How Web traces are collected

For the demo, the UI (frontend-web) was instrumented using the opentelemetry-web-browser instrumentation. It is able to automatically create traces’ spans for fetching resources and user interactions, like click events. In the opentelemetry-demo repository, you can see how the UI was instrumented. Here is another great example of using OTel auto instrumentation.

Traces let teams measure exactly what users experience when they browse the site. Instead of relying on synthetic tests or assumptions, companies can see how fast pages load, whether key features like checkout or login work smoothly, and what problems users actually face.

Adding instrumentation based on OTel means staying vendor-agnostic, which can preferable for these sorts of deployments.

Anytime you load an OpenTelemetry Astronomy Shop, traces are collected and automatically sent to the backend. Using this mechanism, we can create service-level objectives based on real user monitoring. It is not a synthetic load; these traces show what an actual user is experiencing in the user interface. Having browser traces means you can see the experience from the users' point of view. In this post, a few examples of Web-based SLOs will be shown.

671e8a5f-acf1-4c94-a1d3-d64982d892e6

In the Jaeger UI, you can explore traces that show how the frontend of the application performs. For example, you will see a trace when the cart is fetched from the API, along with other requests for static files like images or scripts. These traces help you understand how each part of the user journey is performing, including how quickly the CDN responds. With this level of detail, you can define web-based SLOs that are based on real user activity. While storing large volumes of trace data can be costly, using that data to drive SLOs helps teams stay focused on what truly affects users and gives them a clear signal when something needs to be investigated.

2b437d75-db8c-4d04-b514-b5cf20aa443a

ba9f9426-1d3d-4e02-9ace-97b2d4768d1b

Defining first SLOs using Open Telemetry

Setting up Nobl9 integration

Businesses need more context than whether something is broken or not. They need context on how the problem is affecting the user, based on the user's expectations. In short, businesses need to measure how well their services are meeting customer expectations. By combining OpenTelemetry with Nobl9, companies can define clear reliability goals, called SLOs, and can proactively manage user experience by prioritizing certain fixes, and knowing how the customer feels

To do this, we first deploy the Nobl9 Agent inside the OTel demo cluster.

The OpenSearch instance can be accessed via Kubernetes service on the http://otel-demo-opensearch:9200.

More details about setting integration are available in the documentation.

Good vs. Total Ratio SLOs

Context

In the first example, we want to track the ratio of successful requests to total HTTP requests.
For demonstration purposes, the Cart deployment will be scaled up and down to simulate Cart Service unavailability.

// turn off Cart Service
kubectl scale deployment --replicas=0 -l app.kubernetes.io/component=cartservice
// turn on Cart Service
kubectl scale deployment --replicas=1 -l app.kubernetes.io/component=cartservice

SLI

To calculate the SLO, we need to query for a count of good events and a count of total events. Nobl9 will handle the time periods, it just needs the queries.

Good Count - number of successful GET /api/cart requests. In this example, we treat only status code 200 as a successful request.

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        ...
  
Show full query
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "bool": {
                "must": [
                  {
                    "bool": {
                      "must": [
                        {
                          "match": {
                            "tags.key": "http.url"
                          }
                        },
                        {
                          "regexp": {
                            "tags.value": "http://localhost:8080/api/cart.*"
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "bool": {
                "must": [
                  {
                    "bool": {
                      "must": [
                        {
                          "match": {
                            "tags.key": "http.status_code"
                          }
                        },
                        {
                          "term": {
                            "tags.value": "200"
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "range": {
                  "startTimeMillis": {
                    "gte": "{‌{.BeginTimeInMilliseconds}‌}",
                    "lte": "{‌{.EndTimeInMilliseconds}‌}"
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  },
  "aggs": {
    "resolution": {
      "date_histogram": {
        "field": "startTimeMillis",
        "interval": "{‌{.Resolution}‌}",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "{‌{.BeginTimeInMilliseconds}‌}",
          "max": "{‌{.EndTimeInMilliseconds}‌}"
        }
      },
      "aggs": {
        "n9-val": {
          "value_count": {
            "field": "_id"
          }
        }
      }
    }
  }
}
    

Total Count Query - Total GET /api/cart request

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        ...
  
Show full query
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "tags.key": "http.url"
                    }
                  },
                  {
                    "regexp": {
                      "tags.value": "http://localhost:8080/api/cart\\?.*"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "range": {
                  "startTimeMillis": {
                    "gte": "{‌{.BeginTimeInMilliseconds}‌}",
                    "lte": "{‌{.EndTimeInMilliseconds}‌}"
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  },
  "aggs": {
    "resolution": {
      "date_histogram": {
        "field": "startTimeMillis",
        "interval": "{‌{.Resolution}‌}",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "{‌{.BeginTimeInMilliseconds}‌}",
          "max": "{‌{.EndTimeInMilliseconds}‌}"
        }
      },
      "aggs": {
        "n9-val": {
          "value_count": {
            "field": "_id"
          }
        }
      }
    }
  }
}
    

During the annotated period, the Cart service was unavailable, and there were no K8s Pods to handle GET /API/cart requests. Reliability went down immediately because all requests failed. There were 50 requests, and all of them failed.

8a0c1312-696e-438c-bc09-d7114df3c6f6

Threshold-based SLOs

Context

Let’s look at a scenario where we want to create a latency-based SLO for something like fetching the cart, or even loading a single static image from the CDN. Using real metrics from a user’s browser allows us to model the actual user experience without relying on synthetic testing.

If a user has a poor internet connection, the selected SLI will reflect that reality. In another case, maybe the user’s connection is fine, but the application itself could be better optimized for slower connections. The answer, as engineers often say, is that it depends.

What matters here is recognizing that you can now build SLOs based on real browser metrics. For the purposes of this blog post, we’re simplifying things and treating any HTTP status 200 response as a "good" event. In a production setting, you would take a more detailed look at what counts as good performance depending on the specific user journey you are measuring.

SLI

For a threshold metric, we have to define a single query. In this example, I’ve decided to use the 99th percentile of GET /API/cart requests duration. You can also use other aggregation functions like average, min, or max, but as SREs, we love percentiles.

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        ...
  
Show full query
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "operationName": "HTTP GET"
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "tags.key": "http.url"
                    }
                  },
                  {
                    "regexp": {
                      "tags.value": "http://localhost:8080/api/cart.*"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "range": {
                  "startTimeMillis": {
                    "gte": "{‌{.BeginTimeInMilliseconds}‌}",
                    "lte": "{‌{.EndTimeInMilliseconds}‌}"
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  },
  "aggs": {
    "resolution": {
      "date_histogram": {
        "field": "startTimeMillis",
        "interval": "{‌{.Resolution}‌}",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "{‌{.BeginTimeInMilliseconds}‌}",
          "max": "{‌{.EndTimeInMilliseconds}‌}"
        }
      },
      "aggs": {
        "n9-val": {
          "percentiles": {
            "field": "duration",
            "percents": [99]
          }
        }
      }
    }
  }
}
    

In this example, the browser was working without any throttling; the response time was low. The reliability was perfect - 100%. Fast 4g throttling was enabled for some time, which caused response times to be a bit slower than the expected threshold of 1 second. That made the error budget burn.
The throttling was turned off, so the response time was again acceptable. In the last part of the experiment, the slow 3g throttling was enabled. The latency immediately jumped to around 8s, which again caused the error budget to burn.

5b2b0eba-581c-4bcd-bf63-2baed545af29

Summary

As teams move faster and ship more code with the help of AI and automation, understanding how changes impact real users has never been more critical. By pairing OpenTelemetry with Nobl9, organizations can go beyond basic uptime checks and start measuring what truly matters: how reliable and responsive the experience is for every customer. Whether a full outage or a subtle delay in loading a shopping cart, web-based SLOs built on browser traces give you visibility into the user journey and the power to take action.

With OpenTelemetry capturing real-time data from actual browsers and Nobl9 turning that data into clear, trackable service-level objectives, reliability becomes a shared responsibility across engineering, product, and business teams. This isn’t just about finding problems. It’s about focusing on what users actually feel and experience.

See It In Action

Let us show you exactly how Nobl9 can level up your reliability and user experience

Book a Demo

Do you want to add something? Leave a comment

Latest Articles