Skip to content

hostmetrics: use WMI to fetch ppid #35337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
27 changes: 27 additions & 0 deletions .chloggen/braydonk_hostmetrics_wmi_parent_pid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: hostmetricsreceiver

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Use Windows Management Interface to fetch process.ppid by default. Add `disable_wmi` config option to fallback to old behaviour.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining of the PR references wmi_enabled instead of disable_wmi


# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [32947]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: This also made a change where the parent process ID will not be fetched if the resource attribute is disabled.

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [user]
14 changes: 14 additions & 0 deletions receiver/hostmetricsreceiver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ process:
mute_process_io_error: <true|false>
mute_process_user_error: <true|false>
mute_process_cgroup_error: <true|false>
wmi_enabled: <true|false>
scrape_process_delay: <time>
```

Expand All @@ -133,6 +134,19 @@ The following settings are optional:
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only). This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container. This flag is ignored when `mute_process_all_errors` is set to true as all errors are muted.
- `wmi_enabled` (default: true): allow the scraper to use [Windows Management Instrumentation (WMI)](https://learn.microsoft.com/en-us/windows/win32/wmisdk/wmi-start-page) to fetch some information on Windows. This option has no effect on non-Windows environments.

#### High CPU Usage On Windows

Getting the Parent Process ID of all processes on Windows is a very expensive operation. There are two options to combat this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true after this change? Could you please collect the latest image capture of perfmon with the latest code?

On my box:

$ Measure-Command { Get-WmiObject Win32_Process | Select-Object ProcessId, ParentProcessId, HandleCount }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 796
Ticks             : 7969745
TotalDays         : 9.22424189814815E-06
TotalHours        : 0.000221381805555556
TotalMinutes      : 0.0132829083333333
TotalSeconds      : 0.7969745
TotalMilliseconds : 796.9745

So I rough approximation it will be a bit less than 2 milliseconds per process (I had 437 processes on my box). I will guess that not being on Powershell it should a bit cheaper.

* Allow the collector to use WMI, this is the default behaviour with the `wmi_enabled` configuration option
* Disable Parent Process ID collection like so:
```yaml
process:
resource_attributes:
process.parent_pid:
enabled: false
```

## Advanced Configuration

Expand Down
55 changes: 40 additions & 15 deletions receiver/hostmetricsreceiver/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,23 +108,48 @@ func TestLoadConfig(t *testing.T) {
}
}

func TestLoadInvalidConfig_NoScrapers(t *testing.T) {
factory := NewFactory()
cfg := factory.CreateDefaultConfig()

cm, err := confmaptest.LoadConf(filepath.Join("testdata", "config-noscrapers.yaml"))
require.NoError(t, err)
func TestLoadInvalidConfig(t *testing.T) {
testCases := []struct {
name string
configPath string
failMarshal bool
errorContains string
}{
{
name: "no scrapers",
configPath: "config-noscrapers.yaml",
errorContains: "must specify at least one scraper when using hostmetrics receiver",
},
{
name: "invalid scraper key",
configPath: "config-invalidscraperkey.yaml",
errorContains: "invalid scraper key: invalidscraperkey",
failMarshal: true,
},
{
name: "handles enabled wmi disabled",
configPath: "config-handles-no-wmi.yaml",
errorContains: processscraper.ErrProcessHandlesRequiresWMI.Error(),
},
}

require.NoError(t, cm.Unmarshal(cfg))
require.ErrorContains(t, xconfmap.Validate(cfg), "must specify at least one scraper when using hostmetrics receiver")
}
for _, tc := range testCases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
t.Parallel()

func TestLoadInvalidConfig_InvalidScraperKey(t *testing.T) {
factory := NewFactory()
cfg := factory.CreateDefaultConfig()
factory := NewFactory()
cfg := factory.CreateDefaultConfig()

cm, err := confmaptest.LoadConf(filepath.Join("testdata", "config-invalidscraperkey.yaml"))
require.NoError(t, err)
cm, err := confmaptest.LoadConf(filepath.Join("testdata", tc.configPath))
require.NoError(t, err)

require.ErrorContains(t, cm.Unmarshal(cfg), "invalid scraper key: invalidscraperkey")
if tc.failMarshal {
require.ErrorContains(t, cm.Unmarshal(cfg), tc.errorContains)
} else {
require.NoError(t, cm.Unmarshal(cfg))
require.ErrorContains(t, xconfmap.Validate(cfg), tc.errorContains)
}
})
}
}
73 changes: 73 additions & 0 deletions receiver/hostmetricsreceiver/integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,18 @@
package hostmetricsreceiver

import (
"context"
"os/exec"
"path/filepath"
"runtime"
"testing"
"time"

"github.com/stretchr/testify/require"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/component/componenttest"
"go.opentelemetry.io/collector/consumer/consumertest"
"go.opentelemetry.io/collector/receiver/receivertest"

"github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/scraperinttest"
"github.com/open-telemetry/opentelemetry-collector-contrib/internal/filter/filterset"
Expand Down Expand Up @@ -116,3 +121,71 @@ func Test_ProcessScrapeWithBadRootPathAndEnvVar(t *testing.T) {
),
).Run(t)
}

func Test_Windows_ProcessScrapeWMIInformation(t *testing.T) {
if runtime.GOOS != "windows" {
t.Skip("this integration test is windows exclusive")
}

factory := NewFactory()

rCfg := &Config{}
rCfg.CollectionInterval = time.Second
f := processscraper.NewFactory()
pCfg := f.CreateDefaultConfig().(*processscraper.Config)
pCfg.Metrics.ProcessHandles.Enabled = true
pCfg.ResourceAttributes.ProcessParentPid.Enabled = true
rCfg.Scrapers = map[component.Type]component.Config{
f.Type(): pCfg,
}
cfg := component.Config(rCfg)

sink := new(consumertest.MetricsSink)
r, err := factory.CreateMetrics(context.Background(), receivertest.NewNopSettings(factory.Type()), cfg, sink)
require.NoError(t, err)

// Start the receiver and give it an extra 250 milliseconds after the
// collection interval to perform the scrape.
r.Start(context.Background(), componenttest.NewNopHost())
time.Sleep(rCfg.CollectionInterval + 250*time.Millisecond)
r.Shutdown(context.Background())

// The actual results of the test are non-deterministic, but
// all we want to know is whether the handles and parent PID
// metrics are being retrieved successfully on at least some
// metrics (it doesn't need to work for every process).
metrics := sink.AllMetrics()
var foundValidParentPid, foundValidHandles int
for _, m := range metrics {
rms := m.ResourceMetrics()
for i := 0; i < rms.Len(); i++ {
rm := rms.At(i)

// Check if the resource attributes has the parent PID.
ppid, ok := rm.Resource().Attributes().Get("process.parent_pid")
if ok && ppid.Int() > 0 {
foundValidParentPid++
}

sms := rm.ScopeMetrics()
for j := 0; j < sms.Len(); j++ {
sm := sms.At(j)
ms := sm.Metrics()
for k := 0; k < ms.Len(); k++ {
m := ms.At(k)

// Check if this is a process.handles metric
// with a non-zero datapoint.
if m.Name() == "process.handles" &&
m.Sum().DataPoints().Len() > 0 &&
m.Sum().DataPoints().At(0).IntValue() > 0 {
foundValidHandles++
}
}
}
}
}

require.Positive(t, foundValidHandles)
require.Positive(t, foundValidParentPid)
}
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@
package processscraper // import "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver/internal/scraper/processscraper"

import (
"errors"
"time"

"github.com/open-telemetry/opentelemetry-collector-contrib/internal/filter/filterset"
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver/internal/scraper/processscraper/internal/metadata"
)

var ErrProcessHandlesRequiresWMI = errors.New("the process.handles metric requires WMI to be enabled")

// Config relating to Process Metric Scraper.
type Config struct {
// MetricsBuilderConfig allows to customize scraped metrics/attributes representation.
Expand Down Expand Up @@ -53,6 +56,15 @@ type Config struct {
// ScrapeProcessDelay is used to indicate the minimum amount of time a process must be running
// before metrics are scraped for it. The default value is 0 seconds (0s).
ScrapeProcessDelay time.Duration `mapstructure:"scrape_process_delay"`

WMIEnabled bool `mapstructure:"wmi_enabled"`
}

func (cfg *Config) Validate() error {
if !cfg.WMIEnabled && cfg.Metrics.ProcessHandles.Enabled {
return ErrProcessHandlesRequiresWMI
}
return nil
}

type MatchConfig struct {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Number of disk operations performed by the process.

Number of handles held by the process.

This metric is only available on Windows.
This metric is only available on Windows. It requires Windows Management Interface to be enabled.

| Unit | Metric Type | Value Type | Aggregation Temporality | Monotonic |
| ---- | ----------- | ---------- | ----------------------- | --------- |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ func NewFactory() scraper.Factory {
func createDefaultConfig() component.Config {
return &Config{
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
WMIEnabled: true,
}
}

Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

Loading
Loading