Upgrading to Telemetry API

How to migrate your Istio deployment to the new Telemetry API

Istio now has a first-class API for configuring Telemetry Previously; you'd be configuring telemetry in the meshConfig section of Istio configuration. Switching across wasn't seamless, so figured I'd share the gotchas I ran into here.

Prometheus

If like me, you were removing some dimensions from Istio metrics to reduce cardinality, then you'll have had a section in your config that looks like this:

telemetry:
  enabled: true
    v2:
      enabled: true
        prometheus:
          configOverride:
            outboundSidecar:
              debug: false
              stat_prefix: istio
              metrics:
              - tags_to_remove:
                - destination_canonical_service
                ...

istiooperator.yaml

When you're writing your new Telemetry CR; the above config would look like this:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: default
  namespace: istio-system
spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - disabled: true
      match:
        metric: ALL_METRICS
        mode: SERVER
    - tagOverrides:
        destination_canonical_service:
          operation: REMOVE

telemetry.yaml

WARNING! ALL_METRICS doesn't seem to work!

Yeah... for some reason - I had to explicitly disable each metric on SERVER. It feels like a bug, and I've raised it here.

spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - disabled: true
      match:
        metric: REQUEST_COUNT
        mode: SERVER
    - disabled: true
      match:
        metric: REQUEST_DURATION
        mode: SERVER
    - disabled: true
      match:
        metric: REQUEST_SIZE
        mode: SERVER
    - disabled: true
      match:
        metric: RESPONSE_SIZE
        mode: SERVER

telemetry.yaml

I won't go into the rest of the Custom Resource spec in too much detail, as it's quite well documented on the Telemetry Metrics Overrides documentation. The slightly confusing part here though is that in your istiooperator.yaml, you need to disable telemetry:

telemetry:
  enabled: false

istiooperator.yaml

It's a bit counter intuitive if you ask me, but without this I found that the EnvoyFilters for telemetry V2 would get created, and I ended up with two sets of metrics with different tag configuration being output on my Sidecars metric endpoint. That increased the scrape size of my cluster pretty significantly until I'd rolling restarted all of the Istio workloads.

Another thing I did previous was to remove the Destination metrics, and only record metrics at the source. I did this by editing the EnvoyFilter to remove the SIDECAR_INBOUND section. In the new Telemetry API the terminology is CLIENT and SERVER, so in our example, we want to remove the SERVER metrics:

spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - disabled: true
      match:
        metric: ALL_METRICS
        mode: SERVER

telemetry.yaml

That's about it for prometheus!

Tracing

My original driver for moving to Telemetry V2 was to enable me to explore sending tracers to another tracer than Jaeger. The documentation gives the following example:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
  - providers:
    - name: <name>

telemetry.yaml

And here's the gotcha; notice that providers is an array, however - it can't take an array. You can only send to a single tracer at any time and if you put multiple items in here, only the first will be used. I would assume (and hope) that the plan is to add support for multiple in the future.

Ignoring the gotcha above, these are the changes I needed to make in my meshConfig:

spec:
  meshConfig:
    enableTracing: false
    extensionProviders:
    - name: zipkin
      zipkin:
        service: zipkin.istio-system.svc.cluster.local
        port: 9411
        maxTagLength: 1024
    - name: opencensus
      opencensus:
        service: app.apm.svc.cluster.local
        port: 80
        maxTagLength: 1024
        context:
        - B3

istiooperator.yaml

This replaced this section:

spec:
  defaultConfig:
    tracing:
      max_path_tag_length: 1024
      zipkin:
        address: zipkin.istio-system.svc.cluster.local

istiooperator.yaml

Things to note here, similar to the above example where i disabled telemetry I also set enableTracing: false. Extension providers are well documented, think of these as configuration options for potential destinations - which we'll reference in the Telemetry CRD. In the example above you can see that I've confugred two options. Note how i'm using B3 as the context for opencensus, that enables me to switch between the two and use the same header propogation method as Zipkin.

Another gotcha here is that the new Telemetry system uses EDS to discover endpoints for the extensionProvider. That means if you use a Sidecar resource, you must have the host defined. Here's an example mesh global Sidecar:

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: istio-system
spec:
  egress:
  - hosts:
    - istio-system/zipkin.istio-system.svc.cluster.local
    - kube-system/kube-dns.kube-system.svc.cluster.local
    - core-system/logstash.core-system.svc.cluster.local

sidecar.yaml

Without this, your Sidecar will fail to load the new config as it'll complain about an unknown Cluster. I don't think the UX there is great, and raised it on GitHub.

In order to active your new configuration; you add the following section to your telemetry config:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: default
  namespace: istio-system
spec:
  tracing:
  - providers:
    - name: zipkin
    randomSamplingPercentage: 10

telemetry.yaml

Here you can see i've referenced the provider I created above, with a sample rate of 10%.

Conclusion

That's broadly it. Having a first class API for Telemetry is great. Being able to create application-specific overrides with targetted CR's is lovely.

Migrating wasn't too bad, hopefully this doc will make it a little simpler for the next person.