Routing by origin - service is not good enough #35320

dmedinag · 2024-09-20T13:23:01Z

Component(s)

exporter/loadbalancing

Is your feature request related to a problem? Please describe.

There exist two keys for routing spans today: service and traceID. traceID is great for tail sampling, while service is the only one producing reliable spanmetrics without insane cardinality on big volumes.

Routing spans by service, however, brings a big problem to the table if the different services generating spans produce a very uneven volume. In an environment where a small percentage of services produce >99% of the spans, instances of the otelcol running the spanmetrics connector (or any other trace pipeline receiving spans routed by service) sustain a very asymmetric load, overloading a few instances while others stay idle.

Describe the solution you'd like

service does a good job for spanmetrics because it produces metric series issued by a single collector of the layer behind the load balancing collector, easing up aggregation and range functions requiring multiple datapoints. But for this reason so would pod_name in a k8s environment, or any other proxy for task instance (where task = application that generates spans).

In the context of kubernetes, the IP of the incoming connection uniquely identifies a task instance. My proposal is to add source_ip as a routing key for signals, which would considerably dilute the problem we have with service as routing key by seggregating the load of these big services across as many workers as instances that service has. Since big services are also typically the ones with the most instances in our ecosystem (and others in the industry), while it's not as good as traceID in terms of randomness this is a much more efficient routing key and an effective solution for our problem.

Note 1: this solution would also solve this issue to and any other environment where the IP uniquely identifies a task instance.
Note 2: this routing key would be applicable for spans, but also for metrics and logs.

Describe alternatives you've considered

Ad-hoc pipeline for offenders

Again within a kubernetes environment, we've considered omitting the span traffic of these big services on our central spanmetrics pipeline and deploying otelcols as sidecars with such spanmetrics pipeline.

But this really just hacks the problem, increases our architecture complexity and does not solve the real issue of service being an inherently bad key for symmetric load.

Hack the routing key

Since we have the ability to modify service before it's used as routing key, we introduced a transformer that changes the value of service to {serviceName}%{ip}, to force the loadbalancer exporter to route by this composed key. Then, on the second layer of collectors, we recover the original value of service by removing the artificial suffix and proceed with the processing. This has mitigated the problem for us for now, we see much more symmetric load in our spanmetrics collectors, but again it's just hacking around a limitation on the exporter.

This solution was preferred to us over having a custom adaptation of the loadbalancingexporter on our side so that we simplify the process of maintaining our collectors.

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-20T13:23:26Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2024-11-27T14:31:16Z

I believe @dhftah addressed this problem in the following PR: #35125 . I just reopened and will try to get it into a state that we can merge. Do you think this would address your use-case?

dmedinag · 2024-11-27T15:42:53Z

I believe dhftah addressed this problem in the following PR: #35125 . I just reopened and will try to get it into a state that we can merge. Do you think this would address your use-case?

Theirs is a similar problem, although the solution proposed here is a bit different it could as well, yes.

github-actions · 2025-01-27T03:32:56Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

…nce on composite keys in lb (#36567) Right now, there's a problem at high throughput using the load balancer and the `service.name` resource attribute: The load balancers themself get slow. While it's possible to vertically scale them to a point (e.g. about 100k req/sec), as they get slow they star tot back up traffic and block on requests. Applications then can't write as many spans out, and start dropping spans. This commit seeks to address that by extending the load balancing collector to allow create a composite from attributes that can still keep the load balancing decision "consistent enough" to reduce cardinallity, but still spread the load across ${N} collectors. It doesn't make too many assumptions about how the operators will use this, except that the underlying data (the spans) are unlikely to be complete in all cases, and the key generation is "best effort". This is a deviation from the existing design, in which hard-requires "span.name". == Design Notes === Contributor Skill As a contributor, I'm very much new to the opentelemetry collector, and do not anticipate I will be contributing much except for as needs require to tune the collectors that I am responsible for. Given this, the code may violate certain assumptions that are otherwise "well known". === Required Knowledge The biggest surprise in this code was that despite accepting a slice, the routingIdentifierFromTraces function assumes spans have been processed with the batchpersignal.SplitTraces() function, which appears to ensure taht each "trace" contains only a single span (thus allowing them to be multiplexed effectively) This allows the function to be simplified quite substantially. === Use case The primary use case I am thinking about when writing this work is calculating metrics in the spanmetricsconnector component. Essentially, services drive far too much traffic for a single collector instance to handle, so we need to multiplex it in a way that still allows them to be calculated in a single place (limiting cardinality) but also, spreads the load across ${N} collectors. === Traces only implementation This commit addreses this only for traces, as I only care about traces. The logic can likely be extended easily, however. Fixes #35320 Fixes #33660 --------- Signed-off-by: Juraci Paixão Kröhling <[email protected]> Co-authored-by: Andrew Howden <[email protected]> Co-authored-by: Antoine Toulme <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>

dmedinag added enhancement New feature or request needs triage New item requiring triage labels Sep 20, 2024

github-actions bot added the exporter/loadbalancing label Sep 20, 2024

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

This was referenced Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

Weekly Report: 2024-10-08 - 2024-10-15 #35785

Closed

atoulme added the waiting-for-code-owners label Oct 18, 2024

github-actions bot mentioned this issue Oct 22, 2024

Weekly Report: 2024-10-15 - 2024-10-22 #35905

Closed

This was referenced Oct 29, 2024

Weekly Report: 2024-10-22 - 2024-10-29 #36039

Closed

Weekly Report: 2024-10-29 - 2024-11-05 #36187

Closed

github-actions bot mentioned this issue Nov 12, 2024

Weekly Report: 2024-11-05 - 2024-11-12 #36302

Closed

github-actions bot mentioned this issue Nov 19, 2024

Weekly Report: 2024-11-12 - 2024-11-19 #36426

Closed

github-actions bot mentioned this issue Nov 26, 2024

Weekly Report: 2024-11-19 - 2024-11-26 #36533

Closed

jpkrohling self-assigned this Nov 27, 2024

jpkrohling removed needs triage New item requiring triage waiting-for-code-owners labels Nov 27, 2024

jpkrohling mentioned this issue Nov 27, 2024

[exporter/loadbalancing] feat(lb): Introduce the ability to load balance on composite keys in lb #36567

Merged

github-actions bot added the Stale label Jan 27, 2025

atoulme closed this as completed in #36567 Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Routing by origin - service is not good enough #35320

Routing by origin - service is not good enough #35320

dmedinag commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024

jpkrohling commented Nov 27, 2024

dmedinag commented Nov 27, 2024

github-actions bot commented Jan 27, 2025

Routing by origin - service is not good enough #35320

Routing by origin - service is not good enough #35320

Comments

dmedinag commented Sep 20, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Ad-hoc pipeline for offenders

Hack the routing key

Additional context

github-actions bot commented Sep 20, 2024

jpkrohling commented Nov 27, 2024

dmedinag commented Nov 27, 2024

github-actions bot commented Jan 27, 2025

dmedinag commented Sep 20, 2024 •

edited

Loading