Upgrading to Telemetry API
How to migrate your Istio deployment to the new Telemetry API
Istio now has a first-class API for configuring Telemetry Previously; you'd be configuring telemetry in the meshConfig
section of Istio configuration. Switching across wasn't seamless, so figured I'd share the gotchas I ran into here.
Prometheus
If like me, you were removing some dimensions from Istio metrics to reduce cardinality, then you'll have had a section in your config that looks like this:
telemetry:
enabled: true
v2:
enabled: true
prometheus:
configOverride:
outboundSidecar:
debug: false
stat_prefix: istio
metrics:
- tags_to_remove:
- destination_canonical_service
...
istiooperator.yaml
When you're writing your new Telemetry CR; the above config would look like this:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: default
namespace: istio-system
spec:
metrics:
- providers:
- name: prometheus
overrides:
- disabled: true
match:
metric: ALL_METRICS
mode: SERVER
- tagOverrides:
destination_canonical_service:
operation: REMOVE
telemetry.yaml
WARNING! ALL_METRICS doesn't seem to work!
Yeah... for some reason - I had to explicitly disable each metric on SERVER
. It feels like a bug, and I've raised it here.
spec:
metrics:
- providers:
- name: prometheus
overrides:
- disabled: true
match:
metric: REQUEST_COUNT
mode: SERVER
- disabled: true
match:
metric: REQUEST_DURATION
mode: SERVER
- disabled: true
match:
metric: REQUEST_SIZE
mode: SERVER
- disabled: true
match:
metric: RESPONSE_SIZE
mode: SERVER
telemetry.yaml
I won't go into the rest of the Custom Resource spec in too much detail, as it's quite well documented on the Telemetry Metrics Overrides documentation. The slightly confusing part here though is that in your istiooperator.yaml
, you need to disable telemetry:
telemetry:
enabled: false
istiooperator.yaml
It's a bit counter intuitive if you ask me, but without this I found that the EnvoyFilters
for telemetry V2 would get created, and I ended up with two sets of metrics with different tag configuration being output on my Sidecars metric endpoint. That increased the scrape size of my cluster pretty significantly until I'd rolling restarted all of the Istio workloads.
Another thing I did previous was to remove the Destination
metrics, and only record metrics at the source. I did this by editing the EnvoyFilter
to remove the SIDECAR_INBOUND
section. In the new Telemetry API the terminology is CLIENT
and SERVER
, so in our example, we want to remove the SERVER
metrics:
spec:
metrics:
- providers:
- name: prometheus
overrides:
- disabled: true
match:
metric: ALL_METRICS
mode: SERVER
telemetry.yaml
That's about it for prometheus!
Tracing
My original driver for moving to Telemetry V2 was to enable me to explore sending tracers to another tracer than Jaeger. The documentation gives the following example:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: <name>
telemetry.yaml
And here's the gotcha; notice that providers is an array, however - it can't take an array. You can only send to a single tracer at any time and if you put multiple items in here, only the first will be used. I would assume (and hope) that the plan is to add support for multiple in the future.
Ignoring the gotcha above, these are the changes I needed to make in my meshConfig
:
spec:
meshConfig:
enableTracing: false
extensionProviders:
- name: zipkin
zipkin:
service: zipkin.istio-system.svc.cluster.local
port: 9411
maxTagLength: 1024
- name: opencensus
opencensus:
service: app.apm.svc.cluster.local
port: 80
maxTagLength: 1024
context:
- B3
istiooperator.yaml
This replaced this section:
spec:
defaultConfig:
tracing:
max_path_tag_length: 1024
zipkin:
address: zipkin.istio-system.svc.cluster.local
istiooperator.yaml
Things to note here, similar to the above example where i disabled telemetry I also set enableTracing: false
. Extension providers are well documented, think of these as configuration options for potential destinations - which we'll reference in the Telemetry CRD. In the example above you can see that I've confugred two options. Note how i'm using B3
as the context for opencensus
, that enables me to switch between the two and use the same header propogation method as Zipkin.
Another gotcha here is that the new Telemetry system uses EDS to discover endpoints for the extensionProvider. That means if you use a Sidecar
resource, you must have the host defined. Here's an example mesh global Sidecar:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: default
namespace: istio-system
spec:
egress:
- hosts:
- istio-system/zipkin.istio-system.svc.cluster.local
- kube-system/kube-dns.kube-system.svc.cluster.local
- core-system/logstash.core-system.svc.cluster.local
sidecar.yaml
Without this, your Sidecar
will fail to load the new config as it'll complain about an unknown Cluster. I don't think the UX there is great, and raised it on GitHub.
In order to active your new configuration; you add the following section to your telemetry config:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: default
namespace: istio-system
spec:
tracing:
- providers:
- name: zipkin
randomSamplingPercentage: 10
telemetry.yaml
Here you can see i've referenced the provider I created above, with a sample rate of 10%.
Conclusion
That's broadly it. Having a first class API for Telemetry is great. Being able to create application-specific overrides with targetted CR's is lovely.
Migrating wasn't too bad, hopefully this doc will make it a little simpler for the next person.