Skip to content

Reactive way to activate scroll to get around 10.000 limit #2023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Muenze opened this issue Dec 10, 2021 · 8 comments
Closed

Reactive way to activate scroll to get around 10.000 limit #2023

Muenze opened this issue Dec 10, 2021 · 8 comments
Labels
status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged

Comments

@Muenze
Copy link

Muenze commented Dec 10, 2021

Hello,

sorry, I could not find any hint for that in your documentation.
I read that if I want to get around the 10k limitation (with data repository tools) then I have to define the return type Stream,
however we are using spring data in a reactive way with AWS OS 1.0 (works like a charm), but our findAllBy(searchterms): Flux is also capped at 10.000 elements, so doing something like a report or a cleanup will be nearly impossible because we save >350GB of data, 10.000 is not even the data for 1h.

I thought that Flux is the Stream equal in the reactive world, so i'm wondering how to get around that hard barrier. Could you please help me with that?
I had to paginate my flux, because my search would return way more than 10k elements, more like something that will exceed my memory, each element will contain a few megabytes.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Dec 10, 2021
@sothawo
Copy link
Collaborator

sothawo commented Dec 11, 2021

which version of Spring Data Elasticsearch do you use? Spring Data Elasticsearch uses scrolling when searching with reactive repositories unless you add a Paging request to the repository method or do a limiting query like findFirst500By..()

I just had a test with the current version 4.3 with an index that contains 60.000 entries (dummy persons), of which more than 30.000 have their last name set to the same value.

The repository I use:

public interface PersonRepository extends ReactiveElasticsearchRepository<Person, Long> {
	Flux<Person> findAllByLastName(String name);
}

This returns a Flux that contains 30.019 persons. I have the connection between the application and Elasticsearch proxied with an intercepting proxy and see the following requests:

grafik

So currently this is not reproducible for me.

@sothawo sothawo added the status: waiting-for-feedback We need additional information before we can continue label Dec 11, 2021
@Muenze
Copy link
Author

Muenze commented Dec 11, 2021

Hi, thanks for the response :).
I am using this repository:
image

I am using Spring Boot starter data elasticsearch with version 2.6.1, which is 4.3.0 spring data elasticsearch
We set spring.elasticsearch.webclient.max-in-memory-size to 20MB, we had problems before with memory limit in encoder.
Also to connect to AWS opensearch we use a sidecart container abutaha/aws-es-proxy:v1.1, which handles the AWS V4 signing for us and exposes the opensearch on port 9200 localhost.

When I did my tests with this Flux (used to find data which was inserted with missing params), the stream didnt start at all. I usually quit my tests after 5-10min of inactivity, although the code was just
flux.log().subscribe(), so something should have happened.

We came up with a possible solution for that and used pagination, so we could limit the size for each pull something like 50 or so to not have too big elasticsearch json results, but then we were capped at 10.000.

I would gladly have only 1 stream and not handle the pagination, is there a way to set how much each scroll contains or something like a fetch size, something i can set in configs, in the index annotation or on my query?

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 11, 2021
@sothawo
Copy link
Collaborator

sothawo commented Dec 12, 2021

Properties like spring.elasticsearch.webclient.max-in-memory-size are not processed by Spring Data Elasticsearch, but by Spring Boot, no sure how Spring Boot uses this to configure Spring Data Elasticsearch .

Spring Data Elasticsearch is normally configured with a client configuration (https://docs.spring.io/spring-data/elasticsearch/docs/current/reference/html/#elasticsearch.clients.configuration) in my sample app I have the following:

@Configuration
public class ReactiveRestClientConfig extends AbstractReactiveElasticsearchConfiguration {

	@Bean
	public ClientConfiguration clientConfiguration() {
		return ClientConfiguration.builder() //
			.connectedTo("localhost:9200") //
			.withProxy("localhost:8080")
			.withClientConfigurer((ReactiveRestClients.WebClientConfigurationCallback) webClient -> {
				ExchangeStrategies exchangeStrategies = ExchangeStrategies.builder()
					.codecs(configurer -> configurer.defaultCodecs()
						.maxInMemorySize(-1))
					.build();
				return webClient.mutate().exchangeStrategies(exchangeStrategies).build();
			})
			.build();
	}
}

Notice that I remove the limit for the memory buffer here.

I had a deeper look at the value used to retrieve scroll batches: This is for any search request - either with a "normal" search or a "scroll" search - set to the max value for results when no paging or limit is defined, and this is a value of 10.000.

So this results in the 10.000 batches retrieved on scroll searching and might be too much depending on the size of the documents returned.

I think we could introduce a maxScrollBatchSize property that can be set explicitly on a query object and that would be used when a scroll query is sent. For queries from repository functions which are created under the hood this would default to a value that can be set in the configuration bean. If no value is defined, then the existing default of 10.000 would be used.

What do you think about that?

@sothawo sothawo added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Dec 12, 2021
@Muenze
Copy link
Author

Muenze commented Dec 12, 2021

Hi,

our only concern with this approach is that we would like to configure everything else from configs (because they get injected with properties from k8s vault), so defining the endpoint directly would override it and would be useless for us. This is why we tried with this webclient and it obviously worked, because we debugged and found out that the configured webclient had this limit set to our specified limit.

It would be awesome if we could have this annotation either on query level or on index level (which would also be fine).
I think we might use it wrong, some of the stuff we need to put into the ES uses binary data which is serialized in json and then loaded up. We begin to struggle with a scroll size of high-hundrets to low-thousands, somewhere there must be some kind of barrier for us.

Is there a way we could improve the situation for us while waiting for the change? Maybe publish a hotfix version of spring-data-elasticsearch on our own nexus and use that one, maybe one with a hardcoded 1.000 scroll size, so we could make the code changes and not use that pagination solution?

I think on monday i will check out spring-data-elasticsearch and will start to search for a code change that could help us doing it in one single scroll-flux.

Thanks so much for your time and your help :)

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 12, 2021
@sothawo
Copy link
Collaborator

sothawo commented Dec 13, 2021

Fork the repo, check out the 4.3.x branch. In org.springframework.data.elasticsearch.core.ReactiveElasticsearchTemplate change the following method:

	private Flux<SearchDocument> doFind(Query query, Class<?> clazz, IndexCoordinates index) {

		return Flux.defer(() -> {

			SearchRequest request = requestFactory.searchRequest(query, clazz, index);
			boolean useScroll = !(query.getPageable().isPaged() || query.isLimiting());
			request = prepareSearchRequest(request, useScroll);

			if (useScroll) {
				request.source().size(100);  // <=== set the desired scroll batch size
				return doScroll(request);
			} else {
				return doFind(request);
			}
		});
	}

Set the version to 4.3.0-your-fix in pom.xml line 8 and build it with

./mvnw verify

You need to have docker running for the integration tests. If you just want to build without running the tests, add -DskipTests to the mvn call.

@sothawo sothawo added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Dec 13, 2021
@Muenze
Copy link
Author

Muenze commented Dec 13, 2021

I'm so sorry to ask again, because you were so nice and helpfull.
I cannot access libs-snapshot or even libs-release on spring. Is there any way for me to access it, like sign up for something on a webpage to get access?
Sadly the method is private, otherwise we would have just subclassed the template class

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 13, 2021
@sothawo
Copy link
Collaborator

sothawo commented Dec 13, 2021

that's strange, this is open accessible. I can acees these repos with a tor browser where I definitely do not have any cookies or access stuff set. Some firewall/network rules on your side perhaps?

@Muenze
Copy link
Author

Muenze commented Dec 13, 2021

Thanks, we were able to do it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

No branches or pull requests

3 participants