Scraper feedback, and what to do next? #11238

bogdandrutu · 2024-09-20T22:39:57Z

Current state:

While #9473 was a right call in terms of strangeness that from a string we construct an ID, we should not pass a "type" because the idea was that a "receiver" can have multiple scrapers, but if that is true, all of them will be the same receiver/component type correct?

We have made a mistake probably (big part of the problem is the author of this issue) with the design of the scrapers. Here are a set of questions we need to answer, and based on that we can change the design of the scraper package:

Should the "scraperhelper" be a receiver that supports multiple "scrapers" or a receiver that supports one Scraper and we can have a "multi source" scraper in case we really need to support a receiver to scrape multiple sources?
Should we have a standalone "scraper" type instead of making this a receiver? Or maybe we have only one "scraperreceiver" that allows users to configure multiple scrapers?
Should we have a special receiver that allows configuring scrapers?

During the decision and discussion it should be considered that currently we have a big problem, which is that for each scraper/receiver we create its one timer and we lose the opportunity of batching at the source. Another feedback we got in the past with the current a scraper as a receiver is that user has to configure the scrape interval individually for each receiver.

I like the option 3. A proposed idea is to start from how user should configure this, and see how we can achieve this making sure we "batch" as much as possible all these scrapings:

receivers:
  scraper/30sec:
    collection_interval: 30s
    initial_delay: 10s
    timeout: 10s
    sources:
      sql:
        ... // things specific to sql including enable/disable metrics
      nginx:
        ... // things specific to nginx including enable/disable metrics.

Later we can add capability to do "source discovery" or build a different receiver like the current "receivercreator" which can be adopted to follow the same pattern/structure.

dmitryax · 2024-09-20T23:20:32Z

I like the idea. We can have just one scraper receiver eventually. I see a few opportunities with this approach:

It'll remove the confusion between "push" and "pull" receivers. All the receivers will be "push" except for the scraper, and we don't need kafkametrics in addition to the kafka receiver.
The batching, as you mentioned, with consistent timestamps. However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior. Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.
Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.
We can have additional global (applied to all sources) filtering and aggregation capabilities. Currently, each scraping receiver provides an option to disable metrics (and potentially aggregation), but it has to be configured for every metric separately, which is not always convenient. For example, I want to disable all metrics starting with system.processes. but don't want to list all of them individually. We tried to add wildcard support into the existing metrics config group. It would be much better to keep the scraper.sources.<scraper>.metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter. cc @braydonk

braydonk · 2024-09-20T23:55:33Z

Re 4:

It would be much better to keep the scraper.sources..metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter

Perhaps, I think it's a cleaner config interface for sure. Provided I'm understanding this idea vs the current state of things, configuring enable/disable at the start has one big benefit: scrapers can recognize that certain metrics aren't required at scrape time, and can subsequently opt to ignore them. I have a hostmetrics PR where skipping collection for a particular metric when it's disabled yielded major performance benefits.

FWIW, the spec I wrote for that wildcard matching [1] has the potential to be applied in a way other than the current PR. I would happily welcome a better place for it to live than as generated mdatagen code like in the PR.

[1]: As of writing this is due for updates, since Dmitrii and I have already agreed on a major simplification I can make. But the spirit will remain.

bogdandrutu · 2024-09-23T17:24:31Z

However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior.

Indeed we can configure this, but it gives us the opportunity to have it at lower cost then batching later (less allocations for example).

Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.

We can have a pull of workers (control also how many go routines are spawned) for example or other approach. Definitely we should have it to be able to scale to lots of sources.

Built-in discovery would work perfectly with this new receiver.

Agree

ChrsMark · 2024-09-26T08:46:42Z

Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.

Does this mean that people developing the scrapers/receivers will be responsible for implementing the Discoverable interface for each observer/technology? That might be tricky because people willing to implement a specific receiver/scraper might not care/know about discovering this in various technologies, docker, K8s etc.

I would also like to highlight the proposal to add support for suggestion/hint based discovery:

[receiver_creator/k8s] Add support for generating receivers based on provided annotations' hints opentelemetry-collector-contrib#34427
Add receiver_creator "properties" to enable receiver config definitions in observer.Endpoint metadata opentelemetry-collector-contrib#17418
@dmitryax I assume this would be doable to be supported in this new approach by including the suggestion/hints in the built-in discovery rules?

bogdandrutu · 2024-09-26T17:30:23Z

I think the "discovery" will be a functionality of the scraper receiver that will emit "discovery events" on an internal bus, the scrapers can "subscribe" to the bus and read the discovery events and start scraping if needed.

mx-psi · 2024-10-04T15:02:42Z

One small organization PR over at #11166 will allow us to continue evolving the scraper API while allowing us to mark the rest of receiver API 1.0 first (and thus unblocking stabilizing receivers that are not scraper-based). It doesn't change anything about the scraper API surface other than moving things to a new folder, but please do take a look!

sincejune · 2024-11-05T17:26:33Z

Hi team, I'm wondering that if we are planning to support scraper helper to collect logs(pass data as scraped logs to the next consumer). This is a useful use case like collecting container logs. (posting the question here since this is a thread for scraper feedback :)

Update #11238 No changelog since this is still in progress. Signed-off-by: Bogdan Drutu <[email protected]>

) Update open-telemetry#11238 No changelog since this is still in progress. Signed-off-by: Bogdan Drutu <[email protected]>

#### Description Add scraper for logs  #### Link to tracking issue Relates to #11238  #### Testing Added unit tests  #### Documentation Added

Updates #11238 Signed-off-by: Bogdan Drutu <[email protected]>

#### Description This PR adds obs for logs in scraper/scraperhelper, also introduced new metrics for scraping logs.  #### Link to tracking issue Relates to #11238  #### Testing Added  #### Documentation Added

djaglowski · 2025-01-21T19:42:26Z

Should we have a special receiver that allows configuring scrapers?

During the decision and discussion it should be considered that currently we have a big problem, which is that for each scraper/receiver we create its one timer and we lose the opportunity of batching at the source. Another feedback we got in the past with the current a scraper as a receiver is that user has to configure the scrape interval individually for each receiver.

I like the option 3. A proposed idea is to start from how user should configure this, and see how we can achieve this making sure we "batch" as much as possible all these scrapings:

receivers:
scraper/30sec:
collection_interval: 30s
initial_delay: 10s
timeout: 10s
sources:
sql:
... // things specific to sql including enable/disable metrics
nginx:
... // things specific to nginx including enable/disable metrics.

Why is batching at the source desirable? Generally speaking, isn't there an expectation that telemetry streams may be arriving from many independent services at essentially random times? Why in this case would we make such an effort to align them?

To me it seems like a downside to try to consolidate scrapers like this because the "shared" fields are often better individually. If I want to tweak a timeout for one scraper, now I have to worry about structural problems too. (e.g. split into two receivers, add both to pipelines) It feels much like saying that username is a common field so we should extract those too.

sfc-gh-bdrutu · 2025-01-21T20:30:17Z

To me it seems like a downside to try to consolidate scrapers like this because the "shared" fields are often better individually. If I want to tweak a timeout for one scraper, now I have to worry about structural problems too. (e.g. split into two receivers, add both to pipelines) It feels much like saying that username is a common field so we should extract those too.

Everything that you said is "bad" if I need different configuration, happens right now. The proposed solution helps when this should not happen. Also, in general (not always) for metrics you kind of want to report them kind of with same resolution, so the interval will be the same.

bogdandrutu · 2025-01-21T20:34:34Z

Why is batching at the source desirable? Generally speaking, isn't there an expectation that telemetry streams may be arriving from many independent services at essentially random times? Why in this case would we make such an effort to align them?

Batching is only one thing that we want to add, we are also looking to have a "discovery" mechanism and allow starting/stopping scrapers dynamically, and more examples in @dmitryax comment.

djaglowski · 2025-01-21T20:55:53Z

Everything that you said is "bad" if I need different configuration, happens right now.

I disagree. It's true that right now the user would have to define multiple receivers and put them into pipelines. However, what I'm pointing out is that ongoing maintenance of a configuration becomes more complicated because you must untangle scrapers from each other in order to make a minor tweak to one of them.

I'm not necessarily opposed to the idea of offering a multiple-scraper receiver, but I think there's value in presenting (most) scrapers as standalone receivers too.

Maybe I missed in the above, but I imagine we do the following:

Make each scraper into a standalone module.
Make each current "scraper receiver" into a thin wrapper around corresponding scraper module.
Offer a generic multi-scraper receiver which can operate multiple scrapers.

WDYT?

Followup question: If we were to move away from "scraper receivers" and instead use only a compound multi-scraper receiver, how does one use the builder to define which scrapers are included. (This is a must IMO as scrapers often import large/complex dependencies.)

bogdandrutu · 2025-01-21T21:03:27Z

Followup question: If we were to move away from "scraper receivers" and instead use only a compound multi-scraper receiver, how does one use the builder to define which scrapers are included. (This is a must IMO as scrapers often import large/complex dependencies.)

Will show a prototype for the builder soon.

Make each current "scraper receiver" into a thin wrapper around corresponding scraper module.

Why is this needed? In my opinion if anyone wants to have a receiver definition per scraper, they can achieve the same thing by configuring the "multi-scraper" with only one scraper.

bogdandrutu · 2025-01-21T21:06:32Z

For example, in the "new world" for the zookeeper receiver:

** Before: **

receivers:
  zookeeper:
    endpoint: "localhost:2181"
    collection_interval: 20s
    initial_delay: 1s

** After (still not final the exact config): **

receivers:
  scraper/zookeeper:
    collection_interval: 20s
    initial_delay: 1s
    sources:
      zookeeper:
        endpoint: "localhost:2181"

djaglowski · 2025-01-21T22:33:31Z

Why is this needed? In my opinion if anyone wants to have a receiver definition per scraper, they can achieve the same thing by configuring the "multi-scraper" with only one scraper.

Reasons would be 1) to not break existing users, and 2) arguably simpler getting started for new users.

I know some existing heavy users who will suffer from this but I can't say you're wrong about both being possible. I think it's a manageable migration.

bogdandrutu · 2025-01-21T23:00:08Z

Reasons would be 1) to not break existing users, and 2) arguably simpler getting started for new users.

We will provide automations to do as much as possible and keep this for as long as possible. I will keep this in mind and offer a large period of time.

#### Description This PR added support for logs scraper  #### Link to tracking issue Relevant to #11238  #### Testing Added  #### Documentation Added

Part of open-telemetry/opentelemetry-collector#11238 In this PR, only check that the new scraper interface for component and factory can be used for this use-case. Already proven for hostmetrics so no need for more than this. Items TODO next: * Add support in the integration tests to run directly the scraper. This will also make test more deterministic. * In core will add support for "scraperreceiver" and deprecate zookeeperreceiver when that is available. Signed-off-by: Bogdan Drutu <[email protected]>

…y#37388) Part of open-telemetry/opentelemetry-collector#11238 In this PR, only check that the new scraper interface for component and factory can be used for this use-case. Already proven for hostmetrics so no need for more than this. Items TODO next: * Add support in the integration tests to run directly the scraper. This will also make test more deterministic. * In core will add support for "scraperreceiver" and deprecate zookeeperreceiver when that is available. Signed-off-by: Bogdan Drutu <[email protected]>

alexandreLamarre · 2025-02-11T23:26:53Z

Are there plans to support histogram / exponential_histogram in the scraper generation?

…y#37388) Part of open-telemetry/opentelemetry-collector#11238 In this PR, only check that the new scraper interface for component and factory can be used for this use-case. Already proven for hostmetrics so no need for more than this. Items TODO next: * Add support in the integration tests to run directly the scraper. This will also make test more deterministic. * In core will add support for "scraperreceiver" and deprecate zookeeperreceiver when that is available. Signed-off-by: Bogdan Drutu <[email protected]>

dmitryax mentioned this issue Oct 15, 2024

[receiver_creator/k8s] Add support for generating receivers based on provided annotations' hints open-telemetry/opentelemetry-collector-contrib#34427

Closed

dmitryax mentioned this issue Oct 25, 2024

Leader Election Extension open-telemetry/opentelemetry-collector-contrib#34460

Closed

3 tasks

bogdandrutu mentioned this issue Nov 12, 2024

Add a new scraper package to replace scraperhelper #11657

Merged

bogdandrutu added a commit that referenced this issue Nov 14, 2024

Add a new scraper package to replace scraperhelper (#11657)

7ce4842

Update #11238 No changelog since this is still in progress. Signed-off-by: Bogdan Drutu <[email protected]>

ChrsMark mentioned this issue Nov 29, 2024

Ensure that the k8s annotation-based discovery cannot target other endpoints open-telemetry/opentelemetry-collector-contrib#36595

Open

This was referenced Dec 4, 2024

Add scraper for logs #11799

Merged

Add logs scraper in factory #11822

Closed

This was referenced Dec 16, 2024

Add logs scraper in scraperhelper #11898

Closed

Add logs scraper in scraperhelper #11899

Closed

sincejune mentioned this issue Jan 7, 2025

Add obs for logs in scraper/scraperhelper #12036

Merged

mx-psi mentioned this issue Jan 9, 2025

System metrics - Define transition plan for breaking changes in attributes open-telemetry/semantic-conventions#246

Open

1 task

bogdandrutu mentioned this issue Jan 15, 2025

Add scraper component type support to mdatagen #12092

Merged

github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025

Add scraper component type support to mdatagen (#12092)

60b22d1

Updates #11238 Signed-off-by: Bogdan Drutu <[email protected]>

sincejune mentioned this issue Jan 17, 2025

Support logs scraper #12116

Merged

bogdandrutu mentioned this issue Jan 21, 2025

[chore] Extract zookeeper scraper as standalone module open-telemetry/opentelemetry-collector-contrib#37388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper feedback, and what to do next? #11238

Scraper feedback, and what to do next? #11238

bogdandrutu commented Sep 20, 2024 •

edited

Loading

dmitryax commented Sep 20, 2024

braydonk commented Sep 20, 2024

bogdandrutu commented Sep 23, 2024

ChrsMark commented Sep 26, 2024

bogdandrutu commented Sep 26, 2024 •

edited

Loading

mx-psi commented Oct 4, 2024

sincejune commented Nov 5, 2024

djaglowski commented Jan 21, 2025

sfc-gh-bdrutu commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025

djaglowski commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025 •

edited

Loading

bogdandrutu commented Jan 21, 2025

djaglowski commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025

alexandreLamarre commented Feb 11, 2025

Scraper feedback, and what to do next? #11238

Scraper feedback, and what to do next? #11238

Comments

bogdandrutu commented Sep 20, 2024 • edited Loading

dmitryax commented Sep 20, 2024

braydonk commented Sep 20, 2024

bogdandrutu commented Sep 23, 2024

ChrsMark commented Sep 26, 2024

bogdandrutu commented Sep 26, 2024 • edited Loading

mx-psi commented Oct 4, 2024

sincejune commented Nov 5, 2024

djaglowski commented Jan 21, 2025

sfc-gh-bdrutu commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025

djaglowski commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025 • edited Loading

bogdandrutu commented Jan 21, 2025

djaglowski commented Jan 21, 2025

bogdandrutu commented Jan 21, 2025

alexandreLamarre commented Feb 11, 2025

bogdandrutu commented Sep 20, 2024 •

edited

Loading

bogdandrutu commented Sep 26, 2024 •

edited

Loading

bogdandrutu commented Jan 21, 2025 •

edited

Loading