-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Reactive way to activate scroll to get around 10.000 limit #2023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
which version of Spring Data Elasticsearch do you use? Spring Data Elasticsearch uses scrolling when searching with reactive repositories unless you add a Paging request to the repository method or do a limiting query like I just had a test with the current version 4.3 with an index that contains 60.000 entries (dummy persons), of which more than 30.000 have their last name set to the same value. The repository I use: public interface PersonRepository extends ReactiveElasticsearchRepository<Person, Long> {
Flux<Person> findAllByLastName(String name);
} This returns a So currently this is not reproducible for me. |
Properties like Spring Data Elasticsearch is normally configured with a client configuration (https://docs.spring.io/spring-data/elasticsearch/docs/current/reference/html/#elasticsearch.clients.configuration) in my sample app I have the following: @Configuration
public class ReactiveRestClientConfig extends AbstractReactiveElasticsearchConfiguration {
@Bean
public ClientConfiguration clientConfiguration() {
return ClientConfiguration.builder() //
.connectedTo("localhost:9200") //
.withProxy("localhost:8080")
.withClientConfigurer((ReactiveRestClients.WebClientConfigurationCallback) webClient -> {
ExchangeStrategies exchangeStrategies = ExchangeStrategies.builder()
.codecs(configurer -> configurer.defaultCodecs()
.maxInMemorySize(-1))
.build();
return webClient.mutate().exchangeStrategies(exchangeStrategies).build();
})
.build();
}
} Notice that I remove the limit for the memory buffer here. I had a deeper look at the value used to retrieve scroll batches: This is for any search request - either with a "normal" search or a "scroll" search - set to the max value for results when no paging or limit is defined, and this is a value of 10.000. So this results in the 10.000 batches retrieved on scroll searching and might be too much depending on the size of the documents returned. I think we could introduce a What do you think about that? |
Hi, our only concern with this approach is that we would like to configure everything else from configs (because they get injected with properties from k8s vault), so defining the endpoint directly would override it and would be useless for us. This is why we tried with this webclient and it obviously worked, because we debugged and found out that the configured webclient had this limit set to our specified limit. It would be awesome if we could have this annotation either on query level or on index level (which would also be fine). Is there a way we could improve the situation for us while waiting for the change? Maybe publish a hotfix version of spring-data-elasticsearch on our own nexus and use that one, maybe one with a hardcoded 1.000 scroll size, so we could make the code changes and not use that pagination solution? I think on monday i will check out spring-data-elasticsearch and will start to search for a code change that could help us doing it in one single scroll-flux. Thanks so much for your time and your help :) |
Fork the repo, check out the 4.3.x branch. In private Flux<SearchDocument> doFind(Query query, Class<?> clazz, IndexCoordinates index) {
return Flux.defer(() -> {
SearchRequest request = requestFactory.searchRequest(query, clazz, index);
boolean useScroll = !(query.getPageable().isPaged() || query.isLimiting());
request = prepareSearchRequest(request, useScroll);
if (useScroll) {
request.source().size(100); // <=== set the desired scroll batch size
return doScroll(request);
} else {
return doFind(request);
}
});
} Set the version to 4.3.0-your-fix in
You need to have docker running for the integration tests. If you just want to build without running the tests, add |
I'm so sorry to ask again, because you were so nice and helpfull. |
that's strange, this is open accessible. I can acees these repos with a tor browser where I definitely do not have any cookies or access stuff set. Some firewall/network rules on your side perhaps? |
Thanks, we were able to do it :) |
Hello,
sorry, I could not find any hint for that in your documentation.
I read that if I want to get around the 10k limitation (with data repository tools) then I have to define the return type Stream,
however we are using spring data in a reactive way with AWS OS 1.0 (works like a charm), but our findAllBy(searchterms): Flux is also capped at 10.000 elements, so doing something like a report or a cleanup will be nearly impossible because we save >350GB of data, 10.000 is not even the data for 1h.
I thought that Flux is the Stream equal in the reactive world, so i'm wondering how to get around that hard barrier. Could you please help me with that?
I had to paginate my flux, because my search would return way more than 10k elements, more like something that will exceed my memory, each element will contain a few megabytes.
The text was updated successfully, but these errors were encountered: