Skip to content

implement cardinality limit for spanmetrics #38990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
povilasv opened this issue Mar 26, 2025 · 7 comments · Fixed by #39084
Closed

implement cardinality limit for spanmetrics #38990

povilasv opened this issue Mar 26, 2025 · 7 comments · Fixed by #39084
Assignees
Labels

Comments

@povilasv
Copy link
Contributor

Component(s)

connector/spanmetrics

Is your feature request related to a problem? Please describe.

It's very easy for instrumentations to incidentally put uuid / unique urls into span name, which causes spanmetrics to create high cardinality metrics.

I would like to have Cardinality limit protections similar to what are available in otel SDKS (https://opentelemetry.io/docs/specs/otel/metrics/sdk/#cardinality-limits)

Describe the solution you'd like

Ideally some disabled by default or feature flagged aggregation_cardinality_limit field, which would limit metric cardinality.

This limit should be applied per unique resource.

I.e if I have two applications, that send spans, and only one application sends spans with uuids, only those metrics should be limited.

Additionally to make it similar to OTEL Metric SDK Cardinality limit, each metric should get it's own limit. I.e. calls_total has it's limit also duration_bucket_ms has it's limit.

so the limit is per resource per metric.

Metrics limited by cardinality limit should get all the resource attributes kept, but dimensions (span.name, span.status_code, etc) should be limited instead of dimensions you would get otel.metric.overflow="true" attribute.

Example how golang sdk works:

requests_total{otel_scope_name="example-meter",otel_scope_version="",request_id="43026296-fa6e-4bff-86c6-47490764389f"} 1
requests_total{otel_scope_name="example-meter",otel_scope_version="",request_id="60f17106-56d7-4aa7-85f2-57004c03682b"} 1
requests_total{otel_scope_name="example-meter",otel_scope_version="",request_id="73c8fb59-59f8-486b-b733-c3d4af7fab7a"} 1
requests_total{otel_scope_name="example-meter",otel_scope_version="",request_id="f12cdbdb-edc0-4b73-bc37-f140559c389e"} 1
requests_total{otel_metric_overflow="true",otel_scope_name="example-meter",otel_scope_version=""} 5
export OTEL_GO_X_CARDINALITY_LIMIT=5

func main() {
	// Create stdout exporter
	stdoutExporter, err := stdoutmetric.New()
	if err != nil {
		log.Fatalf("failed to create stdout exporter: %v", err)
	}

	// Create Prometheus exporter
	promExporter, err := prometheus.New()
	if err != nil {
		log.Fatalf("failed to create prometheus exporter: %v", err)
	}

	// Create a meter provider with both exporters
	provider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(stdoutExporter)),
		sdkmetric.WithReader(promExporter),
	)
	defer func() {
		if err := provider.Shutdown(context.Background()); err != nil {
			log.Fatalf("failed to shutdown meter provider: %v", err)
		}
	}()

	otel.SetMeterProvider(provider)
	meter := provider.Meter("example-meter")

	// Create counters
	requestCounter, err := meter.Int64Counter("requests_total")
	if err != nil {
		log.Fatalf("failed to create request counter: %v", err)
	}

	bytesCounter, err := meter.Int64Counter("bytes_processed_total")
	if err != nil {
		log.Fatalf("failed to create bytes counter: %v", err)
	}

	ctx := context.Background()

	// Start a goroutine to continuously record metrics with UUIDs
	go func() {
		for {
			// Generate a new UUID for each request
			requestID := uuid.New().String()
			// Create attributes with UUIDs
			attrs := attribute.NewSet(
				attribute.String("request_id", requestID),
			)

			// Record metrics with the UUID attributes
			requestCounter.Add(ctx, 1, metric.WithAttributes(attrs.ToSlice()...))
			// Simulate some bytes processed (random number between 1000 and 10000)
			bytesCounter.Add(ctx, 1000+time.Now().UnixNano()%9000, metric.WithAttributes(attrs.ToSlice()...))

			time.Sleep(1 * time.Second)
		}
	}()

	// Create HTTP server to expose Prometheus metrics
	http.Handle("/metrics", promhttp.Handler())
	server := &http.Server{
		Addr:    ":8080",
		Handler: nil,
	}

	fmt.Println("Starting server on :8080")
	if err := server.ListenAndServe(); err != nil {
		log.Fatalf("failed to start server: %v", err)
	}
}

Describe alternatives you've considered

No response

Additional context

No response

@povilasv povilasv added enhancement New feature or request needs triage New item requiring triage labels Mar 26, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@iblancasa
Copy link
Contributor

I can work on it.

@Frapschen
Copy link
Contributor

What's the value 5 means for requests_total{otel_metric_overflow="true",otel_scope_name="example-meter",otel_scope_version=""}?

@Frapschen
Copy link
Contributor

I have another perspective on handling high cardinality labelsets for span metrics. While I believe that cardinality limit protections are a good idea, there might be more suitable tools for addressing this issue, such as the Transform Processor.

Perhaps we should also consider removing the meaningless parts of the span?

@iblancasa
Copy link
Contributor

What's the value 5 means for requests_total{otel_metric_overflow="true",otel_scope_name="example-meter",otel_scope_version=""}?

This line indicates there were 5 additional requests beyond the cardinality limit, thus grouped together without distinct request_id.

I have another perspective on handling high cardinality labelsets for span metrics. While I believe that cardinality limit protections are a good idea, there might be more suitable tools for addressing this issue, such as the Transform Processor.

Do you mean implementing this logic at the config level using the Transform Proccesor?

Perhaps we should also consider removing the meaningless parts of the span?

I'm not sure about this. Depending on the user settings, the information used for creating the metrics will vary, right? Like the dimensions to use. Something that we can understand as "meaningless" for us can be important for the user environment. Or are you proposing something different?

@povilasv
Copy link
Contributor Author

povilasv commented Apr 2, 2025

Personally I think OTEL SDKS cardinality are on the SDK level, because they protect the application of not using too much memory. I.E. I create a metric that has uuid in attribute value, applications memory will explode.

If we do it in transform processor, IMO it is a bit too late, since in spanmetrics we already created a bunch of cummulative time series and the memory has exploded.

My proposal would be to first do it here, then maybe package it up and do some kind of generic mechanism for metric data producers.

@JaredTan95 JaredTan95 removed the needs triage New item requiring triage label Apr 5, 2025
@Frapschen
Copy link
Contributor

I personally support this feature, and we need to apply a cardinality limit to all metrics emitted by spanmetrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants