Skip to content

Scraper feedback, and what to do next? #11238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bogdandrutu opened this issue Sep 20, 2024 · 16 comments
Open

Scraper feedback, and what to do next? #11238

bogdandrutu opened this issue Sep 20, 2024 · 16 comments

Comments

@bogdandrutu
Copy link
Member

bogdandrutu commented Sep 20, 2024

Current state:

While #9473 was a right call in terms of strangeness that from a string we construct an ID, we should not pass a "type" because the idea was that a "receiver" can have multiple scrapers, but if that is true, all of them will be the same receiver/component type correct?

We have made a mistake probably (big part of the problem is the author of this issue) with the design of the scrapers. Here are a set of questions we need to answer, and based on that we can change the design of the scraper package:

  1. Should the "scraperhelper" be a receiver that supports multiple "scrapers" or a receiver that supports one Scraper and we can have a "multi source" scraper in case we really need to support a receiver to scrape multiple sources?
  2. Should we have a standalone "scraper" type instead of making this a receiver? Or maybe we have only one "scraperreceiver" that allows users to configure multiple scrapers?
  3. Should we have a special receiver that allows configuring scrapers?

During the decision and discussion it should be considered that currently we have a big problem, which is that for each scraper/receiver we create its one timer and we lose the opportunity of batching at the source. Another feedback we got in the past with the current a scraper as a receiver is that user has to configure the scrape interval individually for each receiver.

I like the option 3. A proposed idea is to start from how user should configure this, and see how we can achieve this making sure we "batch" as much as possible all these scrapings:

receivers:
  scraper/30sec:
    collection_interval: 30s
    initial_delay: 10s
    timeout: 10s
    sources:
      sql:
        ... // things specific to sql including enable/disable metrics
      nginx:
        ... // things specific to nginx including enable/disable metrics.

Later we can add capability to do "source discovery" or build a different receiver like the current "receivercreator" which can be adopted to follow the same pattern/structure.

@dmitryax
Copy link
Member

I like the idea. We can have just one scraper receiver eventually. I see a few opportunities with this approach:

  1. It'll remove the confusion between "push" and "pull" receivers. All the receivers will be "push" except for the scraper, and we don't need kafkametrics in addition to the kafka receiver.

  2. The batching, as you mentioned, with consistent timestamps. However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior. Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.

  3. Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.

  4. We can have additional global (applied to all sources) filtering and aggregation capabilities. Currently, each scraping receiver provides an option to disable metrics (and potentially aggregation), but it has to be configured for every metric separately, which is not always convenient. For example, I want to disable all metrics starting with system.processes. but don't want to list all of them individually. We tried to add wildcard support into the existing metrics config group. It would be much better to keep the scraper.sources.<scraper>.metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter. cc @braydonk

@braydonk
Copy link
Contributor

Re 4:

It would be much better to keep the scraper.sources..metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter

Perhaps, I think it's a cleaner config interface for sure. Provided I'm understanding this idea vs the current state of things, configuring enable/disable at the start has one big benefit: scrapers can recognize that certain metrics aren't required at scrape time, and can subsequently opt to ignore them. I have a hostmetrics PR where skipping collection for a particular metric when it's disabled yielded major performance benefits.

FWIW, the spec I wrote for that wildcard matching [1] has the potential to be applied in a way other than the current PR. I would happily welcome a better place for it to live than as generated mdatagen code like in the PR.

[1]: As of writing this is due for updates, since Dmitrii and I have already agreed on a major simplification I can make. But the spirit will remain.

@bogdandrutu
Copy link
Member Author

However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior.

Indeed we can configure this, but it gives us the opportunity to have it at lower cost then batching later (less allocations for example).

Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.

We can have a pull of workers (control also how many go routines are spawned) for example or other approach. Definitely we should have it to be able to scale to lots of sources.

Built-in discovery would work perfectly with this new receiver.

Agree

@ChrsMark
Copy link
Member

Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.

Does this mean that people developing the scrapers/receivers will be responsible for implementing the Discoverable interface for each observer/technology? That might be tricky because people willing to implement a specific receiver/scraper might not care/know about discovering this in various technologies, docker, K8s etc.

I would also like to highlight the proposal to add support for suggestion/hint based discovery:

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Sep 26, 2024

I think the "discovery" will be a functionality of the scraper receiver that will emit "discovery events" on an internal bus, the scrapers can "subscribe" to the bus and read the discovery events and start scraping if needed.

@mx-psi
Copy link
Member

mx-psi commented Oct 4, 2024

One small organization PR over at #11166 will allow us to continue evolving the scraper API while allowing us to mark the rest of receiver API 1.0 first (and thus unblocking stabilizing receivers that are not scraper-based). It doesn't change anything about the scraper API surface other than moving things to a new folder, but please do take a look!

@sincejune
Copy link
Contributor

Hi team, I'm wondering that if we are planning to support scraper helper to collect logs(pass data as scraped logs to the next consumer). This is a useful use case like collecting container logs. (posting the question here since this is a thread for scraper feedback :)

bogdandrutu added a commit that referenced this issue Nov 14, 2024
Update
#11238

No changelog since this is still in progress.

Signed-off-by: Bogdan Drutu <[email protected]>
djaglowski pushed a commit to djaglowski/opentelemetry-collector that referenced this issue Nov 21, 2024
)

Update
open-telemetry#11238

No changelog since this is still in progress.

Signed-off-by: Bogdan Drutu <[email protected]>
This was referenced Dec 4, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Add scraper for logs

<!-- Issue number if applicable -->
#### Link to tracking issue
Relates to #11238 

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests

<!--Describe the documentation added.-->
#### Documentation
Added

<!--Please delete paragraphs that you did not use before submitting.-->
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR adds obs for logs in scraper/scraperhelper, also introduced new
metrics for scraping logs.

<!-- Issue number if applicable -->
#### Link to tracking issue
Relates to #11238 

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added

<!--Describe the documentation added.-->
#### Documentation
Added

<!--Please delete paragraphs that you did not use before submitting.-->
@djaglowski
Copy link
Member

  1. Should we have a special receiver that allows configuring scrapers?

During the decision and discussion it should be considered that currently we have a big problem, which is that for each scraper/receiver we create its one timer and we lose the opportunity of batching at the source. Another feedback we got in the past with the current a scraper as a receiver is that user has to configure the scrape interval individually for each receiver.

I like the option 3. A proposed idea is to start from how user should configure this, and see how we can achieve this making sure we "batch" as much as possible all these scrapings:

receivers:
scraper/30sec:
collection_interval: 30s
initial_delay: 10s
timeout: 10s
sources:
sql:
... // things specific to sql including enable/disable metrics
nginx:
... // things specific to nginx including enable/disable metrics.

Why is batching at the source desirable? Generally speaking, isn't there an expectation that telemetry streams may be arriving from many independent services at essentially random times? Why in this case would we make such an effort to align them?

To me it seems like a downside to try to consolidate scrapers like this because the "shared" fields are often better individually. If I want to tweak a timeout for one scraper, now I have to worry about structural problems too. (e.g. split into two receivers, add both to pipelines) It feels much like saying that username is a common field so we should extract those too.

@sfc-gh-bdrutu
Copy link

To me it seems like a downside to try to consolidate scrapers like this because the "shared" fields are often better individually. If I want to tweak a timeout for one scraper, now I have to worry about structural problems too. (e.g. split into two receivers, add both to pipelines) It feels much like saying that username is a common field so we should extract those too.

Everything that you said is "bad" if I need different configuration, happens right now. The proposed solution helps when this should not happen. Also, in general (not always) for metrics you kind of want to report them kind of with same resolution, so the interval will be the same.

@bogdandrutu
Copy link
Member Author

Why is batching at the source desirable? Generally speaking, isn't there an expectation that telemetry streams may be arriving from many independent services at essentially random times? Why in this case would we make such an effort to align them?

Batching is only one thing that we want to add, we are also looking to have a "discovery" mechanism and allow starting/stopping scrapers dynamically, and more examples in @dmitryax comment.

@djaglowski
Copy link
Member

Everything that you said is "bad" if I need different configuration, happens right now.

I disagree. It's true that right now the user would have to define multiple receivers and put them into pipelines. However, what I'm pointing out is that ongoing maintenance of a configuration becomes more complicated because you must untangle scrapers from each other in order to make a minor tweak to one of them.

I'm not necessarily opposed to the idea of offering a multiple-scraper receiver, but I think there's value in presenting (most) scrapers as standalone receivers too.

Maybe I missed in the above, but I imagine we do the following:

  1. Make each scraper into a standalone module.
  2. Make each current "scraper receiver" into a thin wrapper around corresponding scraper module.
  3. Offer a generic multi-scraper receiver which can operate multiple scrapers.

WDYT?


Followup question: If we were to move away from "scraper receivers" and instead use only a compound multi-scraper receiver, how does one use the builder to define which scrapers are included. (This is a must IMO as scrapers often import large/complex dependencies.)

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Jan 21, 2025

Followup question: If we were to move away from "scraper receivers" and instead use only a compound multi-scraper receiver, how does one use the builder to define which scrapers are included. (This is a must IMO as scrapers often import large/complex dependencies.)

Will show a prototype for the builder soon.

Make each current "scraper receiver" into a thin wrapper around corresponding scraper module.

Why is this needed? In my opinion if anyone wants to have a receiver definition per scraper, they can achieve the same thing by configuring the "multi-scraper" with only one scraper.

@bogdandrutu
Copy link
Member Author

For example, in the "new world" for the zookeeper receiver:

** Before: **

receivers:
  zookeeper:
    endpoint: "localhost:2181"
    collection_interval: 20s
    initial_delay: 1s

** After (still not final the exact config): **

receivers:
  scraper/zookeeper:
    collection_interval: 20s
    initial_delay: 1s
    sources:
      zookeeper:
        endpoint: "localhost:2181"

@djaglowski
Copy link
Member

Why is this needed? In my opinion if anyone wants to have a receiver definition per scraper, they can achieve the same thing by configuring the "multi-scraper" with only one scraper.

Reasons would be 1) to not break existing users, and 2) arguably simpler getting started for new users.

I know some existing heavy users who will suffer from this but I can't say you're wrong about both being possible. I think it's a manageable migration.

@bogdandrutu
Copy link
Member Author

Reasons would be 1) to not break existing users, and 2) arguably simpler getting started for new users.

We will provide automations to do as much as possible and keep this for as long as possible. I will keep this in mind and offer a large period of time.

github-merge-queue bot pushed a commit that referenced this issue Jan 22, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR added support for logs scraper

<!-- Issue number if applicable -->
#### Link to tracking issue
Relevant to #11238 

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added

<!--Describe the documentation added.-->
#### Documentation
Added

<!--Please delete paragraphs that you did not use before submitting.-->
bogdandrutu added a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Jan 30, 2025
Part of
open-telemetry/opentelemetry-collector#11238

In this PR, only check that the new scraper interface for component and
factory can be used for this use-case. Already proven for hostmetrics so
no need for more than this.

Items TODO next:
* Add support in the integration tests to run directly the scraper. This
will also make test more deterministic.
* In core will add support for "scraperreceiver" and deprecate
zookeeperreceiver when that is available.

Signed-off-by: Bogdan Drutu <[email protected]>
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this issue Feb 8, 2025
…y#37388)

Part of
open-telemetry/opentelemetry-collector#11238

In this PR, only check that the new scraper interface for component and
factory can be used for this use-case. Already proven for hostmetrics so
no need for more than this.

Items TODO next:
* Add support in the integration tests to run directly the scraper. This
will also make test more deterministic.
* In core will add support for "scraperreceiver" and deprecate
zookeeperreceiver when that is available.

Signed-off-by: Bogdan Drutu <[email protected]>
@alexandreLamarre
Copy link

Are there plans to support histogram / exponential_histogram in the scraper generation?

zeck-ops pushed a commit to zeck-ops/opentelemetry-collector-contrib that referenced this issue Apr 23, 2025
…y#37388)

Part of
open-telemetry/opentelemetry-collector#11238

In this PR, only check that the new scraper interface for component and
factory can be used for this use-case. Already proven for hostmetrics so
no need for more than this.

Items TODO next:
* Add support in the integration tests to run directly the scraper. This
will also make test more deterministic.
* In core will add support for "scraperreceiver" and deprecate
zookeeperreceiver when that is available.

Signed-off-by: Bogdan Drutu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants