-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Reduced allocations in merge_sort for short vectors #12029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Awesome! LGTM |
Cool! Will review in a little while. |
I'm not sure about this change, maybe it would be better to handle the short case completely separately to do the initial sort into the temporary space for longer vectors. (Of course this is getting a little non-DRY. :( )
There was some discussion of having both |
// `compare` fails. | ||
let mut working_space = with_capacity(2 * len); | ||
// these both are buffers of length `len`. | ||
let mut buf_dat = working_space.as_mut_ptr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment is never read, and could just be let mut buf_dat = buf_v;
, with a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is read by buf_tmp
, but that could be inlined if that is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, so it is, inlining would be good, or just working_space.unsafe_mut_ref(len)
for buf_tmp
.
in any case, I would prefer out-of-place insertion sort for large vectors (or at least experimenting with it), which would mean this discussion is moot anyway.
👍 |
I have updated the commit based on the minor comments that you had. I am certainly open to making the in-place sorting be a separate path, but I am not sure what the benefit it would bring if it would be doing almost the same thing. The in-place sorting does not appear to have a performance penalty (the benchmarks show a marginal improvement for the non-short case with in-place sorting, although that is hardly authoritative). Keeping the short path separate does afford more future flexibility to change them independently, but nothing prevents that from happening later. What is your motivation for doing so? |
I was mainly thinking the extra cost of doing two copies while insertion sorting (i.e. copying into I imagine it won't be any different for "small" types like |
Here are the results of some benchmarks that I ran locally. The in-place insertion (32 elem max run):
in-place insertion (16 elem max run)
in-place insertion (8 elem max run):
out-of-place (32 run insertion sort)
out-of-place (16 run insertion sort)
out-of-place (8 run insertion sort)
|
Ah, cool! Thanks for investigating. It looks like, as you say: (a) there's a small benefit to out-of-place sorting for large types, but (b) there's a much larger benefit to using a smaller insertion sort threshold. I imagine we could have something like static BASE_INSERTION: uint = 32;
static LARGE_INSERTION: uint = 16;
let insertion = if size_of::<T>() <= 16 { BASE_INSERTION } else { LARGE_INSERTION }; But I'm slightly nervous to include something platform specific like this. If you add the I'm also happy to approve any combination of out-of-place for longer vectors and naively "tuned" insertion runs as above. (I would like the bug filed no matter what you choose; and if you do decide to do the tuning, it should have a comment along the lines of "// FIXME # this adjustment seems to make sorting vectors of large elements a little faster on some platforms, but hasn't been tested/tuned extensively" (or something).) |
In for a penny, in for a pound. I opened #12092, added the
|
Added a seperate in-place insertion sort for short vectors. Increased threshold for insertion short for 8 to 32 elements for small types and 16 for larger types. Added benchmarks for sorting larger types.
@huonw Removed parentheses, and added comment |
This pull request: 1) Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed. 2) Increases the increases the maximum run length to use insertion sort for from 8 to 32 elements. This increases the size of vectors that will not allocate, and reduces the number of merge passes by two. It seemed to be the sweet spot in the benchmarks that I ran. Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors. Before changes: ``` test vec::bench::sort_random_large bench: 719753 ns/iter (+/- 130173) = 111 MB/s test vec::bench::sort_random_medium bench: 4726 ns/iter (+/- 742) = 169 MB/s test vec::bench::sort_random_small bench: 344 ns/iter (+/- 76) = 116 MB/s test vec::bench::sort_sorted bench: 437244 ns/iter (+/- 70043) = 182 MB/s ``` Deferred allocation (8 element insertion sort): ``` test vec::bench::sort_random_large bench: 702630 ns/iter (+/- 88158) = 113 MB/s test vec::bench::sort_random_medium bench: 4529 ns/iter (+/- 497) = 176 MB/s test vec::bench::sort_random_small bench: 185 ns/iter (+/- 49) = 216 MB/s test vec::bench::sort_sorted bench: 425853 ns/iter (+/- 60907) = 187 MB/s ``` Deferred allocation (16 element insertion sort): ``` test vec::bench::sort_random_large bench: 692783 ns/iter (+/- 165837) = 115 MB/s test vec::bench::sort_random_medium bench: 4434 ns/iter (+/- 722) = 180 MB/s test vec::bench::sort_random_small bench: 187 ns/iter (+/- 38) = 213 MB/s test vec::bench::sort_sorted bench: 393783 ns/iter (+/- 85548) = 203 MB/s ``` Deferred allocation (32 element insertion sort): ``` test vec::bench::sort_random_large bench: 682556 ns/iter (+/- 131008) = 117 MB/s test vec::bench::sort_random_medium bench: 4370 ns/iter (+/- 1369) = 183 MB/s test vec::bench::sort_random_small bench: 179 ns/iter (+/- 32) = 223 MB/s test vec::bench::sort_sorted bench: 358353 ns/iter (+/- 65423) = 223 MB/s ``` Deferred allocation (64 element insertion sort): ``` test vec::bench::sort_random_large bench: 712040 ns/iter (+/- 132454) = 112 MB/s test vec::bench::sort_random_medium bench: 4425 ns/iter (+/- 784) = 180 MB/s test vec::bench::sort_random_small bench: 179 ns/iter (+/- 81) = 223 MB/s test vec::bench::sort_sorted bench: 317812 ns/iter (+/- 62675) = 251 MB/s ``` This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases. for #12011 cc @huonw
Fix typo "GreeNode" in syntax.md
…-matches!, r=llogiq 6459: Check for redundant `matches!` with `Ready`, `Pending`, `V4`, `V6` Fixes rust-lang#6459. ``` changelog: [`redundant_pattern_matching`]: Add checks for `Poll::{Ready,Pending}` and `IpAddr::{V4,V6}` in `matches!` ```
This pull request:
Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors.
Before changes:
Deferred allocation (8 element insertion sort):
Deferred allocation (16 element insertion sort):
Deferred allocation (32 element insertion sort):
Deferred allocation (64 element insertion sort):
This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.
for #12011
cc @huonw