Skip to content

[receiver/hostmetrics] Use native API instead of WMI to get process.handles on Windows #38886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pjanotti
Copy link
Contributor

@pjanotti pjanotti commented Mar 23, 2025

Using the native GetProcessHandleCount Win32 API via NumFDsWithContext already implemented by gopsutil gives a noticeable performance improvement over the current version using WMI.

  • For non-Windows OSes the same behavior is kept for the metric process.handles: it returns a not supported platform error.
  • Corrected the documentation of process.open_file_descriptors with the actual behavior - as future breaking change we can consider return not supported for this metric on Windows.

Expected performance improvement is shown on the image below. On the image the red line is the new implementation, the green the current one using WMI, and in blue is the WMI service handling the query made by the current implementation of the scraper.

image

@@ -957,15 +979,15 @@ func TestScrapeMetrics_ProcessErrors(t *testing.T) {
executableError = test.cmdlineError
}

expectedResourceMetricsLen, expectedMetricsLen := getExpectedLengthOfReturnedMetrics(test.nameError, executableError, test.timesError, test.memoryInfoError, test.memoryPercentError, test.ioCountersError, test.pageFaultsError, test.numThreadsError, test.numCtxSwitchesError, test.numFDsError, test.rlimitError, test.cgroupError, test.createTimeError)
expectedResourceMetricsLen, expectedMetricsLen := getExpectedLengthOfReturnedMetrics(test.nameError, executableError, test.timesError, test.memoryInfoError, test.memoryPercentError, test.ioCountersError, test.pageFaultsError, test.numThreadsError, test.numCtxSwitchesError, test.numFDsError, test.handleCountError, test.rlimitError, test.cgroupError, test.createTimeError)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern already in place, but, the number of parameters is getting too big. I will follow up with a chore PR to pass all of these in a struct or map.

@songy23 songy23 added the Run Windows Enable running windows test on a PR label Mar 24, 2025
@pjanotti
Copy link
Contributor Author

Test failure is unrelated to this PR, following up in #38903

=== FAIL: k8sclient TestNewEndpointClient (1.75s)
    testing.go:1232: TempDir RemoveAll cleanup: remove C:\Users\RUNNER~1\AppData\Local\Temp\TestNewEndpointClient3930496495\001\kubeconfig1403065089: The process cannot access the file because it is being used by another process.

=== FAIL: k8sclient TestNewEndpointClient (re-run 1) (1.49s)
    testing.go:1232: TempDir RemoveAll cleanup: remove C:\Users\RUNNER~1\AppData\Local\Temp\TestNewEndpointClient3463307539\001\kubeconfig146371[80](https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/14038112541/job/39301144731?pr=38886#step:8:81)98: The process cannot access the file because it is being used by another process.

Copy link
Contributor

@braydonk braydonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, left a comment about process.open_file_descriptors based on how we're going to be handling it in Semantic Conventions.

@@ -170,7 +170,10 @@ metrics:
process.open_file_descriptors:
enabled: false
description: Number of file descriptors in use by the process.
extended_documentation: This metric is only available on Linux.
extended_documentation: >-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is kind of in flux in Semantic Conventions (see open-telemetry/semantic-conventions#1798) and will likely be designated Unix-exclusive for reasons stated in this comment. We're still in the process of figuring out how we'll handle all the semconv transition, but we like to get ahead of it where we can.

As a result, I think this metric should remain Linux-exclusive as it was already, and produce an error if enabled on Windows. I think this used to be the behaviour, but it inadvertently changed on a gopsutil upgrade when this was implemented on Windows upstream. I actually forgot this was done (and was kind of hoping they wouldn't do it this way I wanted them to introduce a different API for it but the codebase wasn't structured to allow for that at the time).

I think the best thing to do here would be to introduce some validation checks in the process scraper config to fail if enabling open_file_descriptors on Windows, or process.handles on Linux. However, this would be a breaking change; the previous behaviour is that the scrape would fail every time trying to write those metrics. Maybe I'll follow this up with a featuregated move to add validation for these metrics instead of allowing them and failing them every scrape.

Sorry for the long comment, but to summarize: I think we should change this to erroring on every scrape if process.open_file_descriptors is enabled on Linux, similar to process.handles. It can produce a platform not supported error similar to the handles one.

Copy link
Contributor Author

@pjanotti pjanotti Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @braydonk - I agree open_file_descriptors should error on Windows, as process.handles should error on non-Windows. I tend to think that reporting the error once at startup is better than on every scrape, but, this is something that we can discuss later.

I liked your suggestion for the names:

process.unix.file_descriptor.count
process.windows.handle.count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I subscribed to the spec issue, as soon as it is settled, let's move to implement it.

@pjanotti
Copy link
Contributor Author

pjanotti commented Mar 25, 2025

Errors on CI accounted by: #38961 and two hits of #38955 - updating the branch for a new run.

@pjanotti
Copy link
Contributor Author

Errors on CI accounted by: #38903 and two hits of #38955 - updating the branch for a new run.

@dmitryax dmitryax merged commit d042dd8 into open-telemetry:main Mar 25, 2025
201 of 204 checks passed
@github-actions github-actions bot added this to the next release milestone Mar 25, 2025
@pjanotti pjanotti deleted the processscraper-optimize-process.handles branch March 25, 2025 23:35
Fiery-Fenix pushed a commit to Fiery-Fenix/opentelemetry-collector-contrib that referenced this pull request Apr 24, 2025
…handles` on Windows (open-telemetry#38886)

Using the native
[`GetProcessHandleCount`](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-getprocesshandlecount)
Win32 API via `NumFDsWithContext` already implemented by `gopsutil`
gives a noticeable performance improvement over the current version
using WMI.

* For non-Windows OSes the same behavior is kept for the metric
`process.handles`: it returns a not supported platform error.
* Corrected the documentation of `process.open_file_descriptors` with
the actual behavior - as future breaking change we can consider return
not supported for this metric on Windows.

Expected performance improvement is shown on the image below. On the
image the red line is the new implementation, the green the current one
using WMI, and in blue is the WMI service handling the query made by the
current implementation of the scraper.


![image](https://github.com/user-attachments/assets/4db79415-9e38-4c7f-a27c-99c24ffb24ed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants