Reduced allocations in merge_sort for short vectors #12029

zkamsler · 2014-02-04T18:22:52Z

This pull request:

Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed.
Increases the increases the maximum run length to use insertion sort for from 8 to 32 elements. This increases the size of vectors that will not allocate, and reduces the number of merge passes by two. It seemed to be the sweet spot in the benchmarks that I ran.

Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors.
Before changes:

test vec::bench::sort_random_large      bench:    719753 ns/iter (+/- 130173) = 111 MB/s
test vec::bench::sort_random_medium     bench:      4726 ns/iter (+/- 742) = 169 MB/s
test vec::bench::sort_random_small      bench:       344 ns/iter (+/- 76) = 116 MB/s
test vec::bench::sort_sorted            bench:    437244 ns/iter (+/- 70043) = 182 MB/s

Deferred allocation (8 element insertion sort):

test vec::bench::sort_random_large      bench:    702630 ns/iter (+/- 88158) = 113 MB/s
test vec::bench::sort_random_medium     bench:      4529 ns/iter (+/- 497) = 176 MB/s
test vec::bench::sort_random_small      bench:       185 ns/iter (+/- 49) = 216 MB/s
test vec::bench::sort_sorted            bench:    425853 ns/iter (+/- 60907) = 187 MB/s

Deferred allocation (16 element insertion sort):

test vec::bench::sort_random_large      bench:    692783 ns/iter (+/- 165837) = 115 MB/s
test vec::bench::sort_random_medium     bench:      4434 ns/iter (+/- 722) = 180 MB/s
test vec::bench::sort_random_small      bench:       187 ns/iter (+/- 38) = 213 MB/s
test vec::bench::sort_sorted            bench:    393783 ns/iter (+/- 85548) = 203 MB/s

Deferred allocation (32 element insertion sort):

test vec::bench::sort_random_large      bench:    682556 ns/iter (+/- 131008) = 117 MB/s
test vec::bench::sort_random_medium     bench:      4370 ns/iter (+/- 1369) = 183 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 32) = 223 MB/s
test vec::bench::sort_sorted            bench:    358353 ns/iter (+/- 65423) = 223 MB/s

Deferred allocation (64 element insertion sort):

test vec::bench::sort_random_large      bench:    712040 ns/iter (+/- 132454) = 112 MB/s
test vec::bench::sort_random_medium     bench:      4425 ns/iter (+/- 784) = 180 MB/s
test vec::bench::sort_random_small      bench:       179 ns/iter (+/- 81) = 223 MB/s
test vec::bench::sort_sorted            bench:    317812 ns/iter (+/- 62675) = 251 MB/s

This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.

for #12011
cc @huonw

emberian · 2014-02-04T19:30:24Z

Awesome! LGTM

huonw · 2014-02-04T21:11:20Z

Cool! Will review in a little while.

huonw · 2014-02-04T23:12:33Z

Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed.

I'm not sure about this change, maybe it would be better to handle the short case completely separately to do the initial sort into the temporary space for longer vectors. (Of course this is getting a little non-DRY. :( )

If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases.

There was some discussion of having both sort and stable_sort (or unstable_sort and sort). I don't think a decision was made.

huonw · 2014-02-04T23:14:22Z

src/libstd/vec.rs

+    // `compare` fails.
+    let mut working_space = with_capacity(2 * len);
+    // these both are buffers of length `len`.
+    let mut buf_dat = working_space.as_mut_ptr();


This assignment is never read, and could just be let mut buf_dat = buf_v;, with a comment.

It is read by buf_tmp, but that could be inlined if that is clearer.

Oh, yes, so it is, inlining would be good, or just working_space.unsafe_mut_ref(len) for buf_tmp.

in any case, I would prefer out-of-place insertion sort for large vectors (or at least experimenting with it), which would mean this discussion is moot anyway.

brson · 2014-02-04T23:22:01Z

👍

zkamsler · 2014-02-05T16:54:50Z

I have updated the commit based on the minor comments that you had. I am certainly open to making the in-place sorting be a separate path, but I am not sure what the benefit it would bring if it would be doing almost the same thing. The in-place sorting does not appear to have a performance penalty (the benchmarks show a marginal improvement for the non-short case with in-place sorting, although that is hardly authoritative).

Keeping the short path separate does afford more future flexibility to change them independently, but nothing prevents that from happening later. What is your motivation for doing so?

huonw · 2014-02-06T00:18:47Z

I was mainly thinking the extra cost of doing two copies while insertion sorting (i.e. copying into tmp, then copying tmp back into the vector); but, as you say, there appears to still be a speed up.

I imagine it won't be any different for "small" types like u64, but might show some difference for a type like (u64, u64, u64, u64) (or even larger, but not so large as to make using ~ the correct option) where copies are more significant. Could you add three benchmarks (i.e. small, medium, large) with a larger type than just u64?

zkamsler · 2014-02-06T17:59:38Z

Here are the results of some benchmarks that I ran locally. The _big ones are sorting vectors of (u64,u64,u64,u46). Out of place might be slightly faster for the big ones, but it is noisy enough that I cannot say whether it is significant. The larger insertion sort runs do appear to be slower when sorting the tuples, which makes sense given that insertion sort has a lot more writes than merge sort.

in-place insertion (32 elem max run):

test vec::bench::sort_big_random_large    ... bench:   1512893 ns/iter (+/- 154790) = 211 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8542 ns/iter (+/- 1369) = 374 MB/s
test vec::bench::sort_big_random_small    ... bench:       246 ns/iter (+/- 147) = 650 MB/s
test vec::bench::sort_big_sorted          ... bench:    536906 ns/iter (+/- 61414) = 595 MB/s
test vec::bench::sort_random_large        ... bench:    685387 ns/iter (+/- 86290) = 116 MB/s
test vec::bench::sort_random_medium       ... bench:      4277 ns/iter (+/- 915) = 187 MB/s
test vec::bench::sort_random_small        ... bench:       183 ns/iter (+/- 50) = 218 MB/s
test vec::bench::sort_sorted              ... bench:    364977 ns/iter (+/- 63217) = 219 MB/s

in-place insertion (16 elem max run)

test vec::bench::sort_big_random_large    ... bench:   1447282 ns/iter (+/- 247736) = 220 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8063 ns/iter (+/- 1136) = 396 MB/s
test vec::bench::sort_big_random_small    ... bench:       242 ns/iter (+/- 56) = 661 MB/s
test vec::bench::sort_big_sorted          ... bench:    580873 ns/iter (+/- 347205) = 550 MB/s
test vec::bench::sort_random_large        ... bench:    693611 ns/iter (+/- 130341) = 115 MB/s
test vec::bench::sort_random_medium       ... bench:      4461 ns/iter (+/- 1039) = 179 MB/s
test vec::bench::sort_random_small        ... bench:       180 ns/iter (+/- 36) = 222 MB/s
test vec::bench::sort_sorted              ... bench:    382591 ns/iter (+/- 80074) = 209 MB/s

in-place insertion (8 elem max run):

test vec::bench::sort_big_random_large    ... bench:   1453949 ns/iter (+/- 225381) = 219 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7975 ns/iter (+/- 1034) = 401 MB/s
test vec::bench::sort_big_random_small    ... bench:       241 ns/iter (+/- 42) = 663 MB/s
test vec::bench::sort_big_sorted          ... bench:    607379 ns/iter (+/- 78551) = 526 MB/s
test vec::bench::sort_random_large        ... bench:    791019 ns/iter (+/- 178413) = 101 MB/s
test vec::bench::sort_random_medium       ... bench:      4426 ns/iter (+/- 519) = 180 MB/s
test vec::bench::sort_random_small        ... bench:       180 ns/iter (+/- 26) = 222 MB/s
test vec::bench::sort_sorted              ... bench:    419979 ns/iter (+/- 50808) = 190 MB/s

out-of-place (32 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1454360 ns/iter (+/- 276385) = 219 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8259 ns/iter (+/- 1190) = 387 MB/s
test vec::bench::sort_big_random_small    ... bench:       242 ns/iter (+/- 72) = 661 MB/s
test vec::bench::sort_big_sorted          ... bench:    526373 ns/iter (+/- 168484) = 607 MB/s
test vec::bench::sort_random_large        ... bench:    692873 ns/iter (+/- 140006) = 115 MB/s
test vec::bench::sort_random_medium       ... bench:      4342 ns/iter (+/- 677) = 184 MB/s
test vec::bench::sort_random_small        ... bench:       177 ns/iter (+/- 35) = 225 MB/s
test vec::bench::sort_sorted              ... bench:    380630 ns/iter (+/- 66918) = 210 MB/s

out-of-place (16 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1393579 ns/iter (+/- 224143) = 229 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7802 ns/iter (+/- 1219) = 410 MB/s
test vec::bench::sort_big_random_small    ... bench:       245 ns/iter (+/- 39) = 653 MB/s
test vec::bench::sort_big_sorted          ... bench:    584058 ns/iter (+/- 143346) = 547 MB/s
test vec::bench::sort_random_large        ... bench:    702402 ns/iter (+/- 178147) = 113 MB/s
test vec::bench::sort_random_medium       ... bench:      4500 ns/iter (+/- 900) = 177 MB/s
test vec::bench::sort_random_small        ... bench:       183 ns/iter (+/- 36) = 218 MB/s
test vec::bench::sort_sorted              ... bench:    413594 ns/iter (+/- 65994) = 193 MB/s

out-of-place (8 run insertion sort)

test vec::bench::sort_big_random_large    ... bench:   1405837 ns/iter (+/- 285931) = 227 MB/s
test vec::bench::sort_big_random_medium   ... bench:      8434 ns/iter (+/- 2950) = 379 MB/s
test vec::bench::sort_big_random_small    ... bench:       251 ns/iter (+/- 60) = 637 MB/s
test vec::bench::sort_big_sorted          ... bench:    663738 ns/iter (+/- 109683) = 481 MB/s
test vec::bench::sort_random_large        ... bench:    752714 ns/iter (+/- 135182) = 106 MB/s
test vec::bench::sort_random_medium       ... bench:      5793 ns/iter (+/- 3496) = 138 MB/s
test vec::bench::sort_random_small        ... bench:       187 ns/iter (+/- 60) = 213 MB/s
test vec::bench::sort_sorted              ... bench:    430830 ns/iter (+/- 113507) = 185 MB/s

huonw · 2014-02-06T23:36:16Z

Ah, cool! Thanks for investigating.

It looks like, as you say: (a) there's a small benefit to out-of-place sorting for large types, but (b) there's a much larger benefit to using a smaller insertion sort threshold. I imagine we could have something like

static BASE_INSERTION: uint = 32;
static LARGE_INSERTION: uint = 16;

let insertion = if size_of::<T>() <= 16 { BASE_INSERTION } else { LARGE_INSERTION };

But I'm slightly nervous to include something platform specific like this.

If you add the _big benchmarks to this PR, I'm happy to approve this if you file a bug along the lines of "tune vec.sort()'s insertion threshold" with the numbers you have above.

I'm also happy to approve any combination of out-of-place for longer vectors and naively "tuned" insertion runs as above. (I would like the bug filed no matter what you choose; and if you do decide to do the tuning, it should have a comment along the lines of "// FIXME # this adjustment seems to make sorting vectors of large elements a little faster on some platforms, but hasn't been tested/tuned extensively" (or something).)

zkamsler · 2014-02-07T18:51:38Z

In for a penny, in for a pound. I opened #12092, added the _big tests, and switched to out-of place insertion for large vectors with a varying threshold.

test vec::bench::sort_big_random_large    ... bench:   1389769 ns/iter (+/- 256837) = 230 MB/s
test vec::bench::sort_big_random_medium   ... bench:      7605 ns/iter (+/- 1051) = 420 MB/s
test vec::bench::sort_big_random_small    ... bench:       227 ns/iter (+/- 72) = 704 MB/s
test vec::bench::sort_big_sorted          ... bench:    585882 ns/iter (+/- 118334) = 545 MB/s
test vec::bench::sort_random_large        ... bench:    712451 ns/iter (+/- 119542) = 112 MB/s
test vec::bench::sort_random_medium       ... bench:      4578 ns/iter (+/- 868) = 174 MB/s
test vec::bench::sort_random_small        ... bench:       165 ns/iter (+/- 55) = 242 MB/s
test vec::bench::sort_sorted              ... bench:    379892 ns/iter (+/- 70648) = 210 MB/s

Added a seperate in-place insertion sort for short vectors. Increased threshold for insertion short for 8 to 32 elements for small types and 16 for larger types. Added benchmarks for sorting larger types.

zkamsler · 2014-02-07T22:14:02Z

@huonw Removed parentheses, and added comment

@huonw

This pull request: 1) Changes the initial insertion sort to be in-place, and defers allocation of working set until merge is needed. 2) Increases the increases the maximum run length to use insertion sort for from 8 to 32 elements. This increases the size of vectors that will not allocate, and reduces the number of merge passes by two. It seemed to be the sweet spot in the benchmarks that I ran. Here are the results of some benchmarks. Note that they are sorting u64s, so types that are more expensive to compare or copy may have different behaviors. Before changes: ``` test vec::bench::sort_random_large bench: 719753 ns/iter (+/- 130173) = 111 MB/s test vec::bench::sort_random_medium bench: 4726 ns/iter (+/- 742) = 169 MB/s test vec::bench::sort_random_small bench: 344 ns/iter (+/- 76) = 116 MB/s test vec::bench::sort_sorted bench: 437244 ns/iter (+/- 70043) = 182 MB/s ``` Deferred allocation (8 element insertion sort): ``` test vec::bench::sort_random_large bench: 702630 ns/iter (+/- 88158) = 113 MB/s test vec::bench::sort_random_medium bench: 4529 ns/iter (+/- 497) = 176 MB/s test vec::bench::sort_random_small bench: 185 ns/iter (+/- 49) = 216 MB/s test vec::bench::sort_sorted bench: 425853 ns/iter (+/- 60907) = 187 MB/s ``` Deferred allocation (16 element insertion sort): ``` test vec::bench::sort_random_large bench: 692783 ns/iter (+/- 165837) = 115 MB/s test vec::bench::sort_random_medium bench: 4434 ns/iter (+/- 722) = 180 MB/s test vec::bench::sort_random_small bench: 187 ns/iter (+/- 38) = 213 MB/s test vec::bench::sort_sorted bench: 393783 ns/iter (+/- 85548) = 203 MB/s ``` Deferred allocation (32 element insertion sort): ``` test vec::bench::sort_random_large bench: 682556 ns/iter (+/- 131008) = 117 MB/s test vec::bench::sort_random_medium bench: 4370 ns/iter (+/- 1369) = 183 MB/s test vec::bench::sort_random_small bench: 179 ns/iter (+/- 32) = 223 MB/s test vec::bench::sort_sorted bench: 358353 ns/iter (+/- 65423) = 223 MB/s ``` Deferred allocation (64 element insertion sort): ``` test vec::bench::sort_random_large bench: 712040 ns/iter (+/- 132454) = 112 MB/s test vec::bench::sort_random_medium bench: 4425 ns/iter (+/- 784) = 180 MB/s test vec::bench::sort_random_small bench: 179 ns/iter (+/- 81) = 223 MB/s test vec::bench::sort_sorted bench: 317812 ns/iter (+/- 62675) = 251 MB/s ``` This is the best I could manage with the basic merge sort while keeping the invariant that the original vector must contain each element exactly once when the comparison function is called. If one is not married to a stable sort, an in-place n*log(n) sorting algorithm may have better performance in some cases. for #12011 cc @huonw

Fix typo "GreeNode" in syntax.md

…-matches!, r=llogiq 6459: Check for redundant `matches!` with `Ready`, `Pending`, `V4`, `V6` Fixes rust-lang#6459. ``` changelog: [`redundant_pattern_matching`]: Add checks for `Poll::{Ready,Pending}` and `IpAddr::{V4,V6}` in `matches!` ```

huonw reviewed Feb 4, 2014
View reviewed changes

zkamsler mentioned this pull request Feb 7, 2014

Tune merge_sort insertion threshold #12092

Closed

Reduced allocations in merge_sort for short vectors

cebe5e8

Added a seperate in-place insertion sort for short vectors. Increased threshold for insertion short for 8 to 32 elements for small types and 16 for larger types. Added benchmarks for sorting larger types.

bors closed this Feb 7, 2014

bors merged commit cebe5e8 into rust-lang:master Feb 7, 2014

huonw mentioned this pull request Feb 14, 2014

Optimise std::vec's .sort_by for short vectors (remove/reduce allocations) #12011

Closed

bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 25, 2022

Auto merge of rust-lang#12029 - xuhongxu96:master, r=lnicola

e3ec877

Fix typo "GreeNode" in syntax.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduced allocations in merge_sort for short vectors #12029

Reduced allocations in merge_sort for short vectors #12029

zkamsler commented Feb 4, 2014

emberian commented Feb 4, 2014

huonw commented Feb 4, 2014

huonw commented Feb 4, 2014

huonw Feb 4, 2014

zkamsler Feb 5, 2014

huonw Feb 5, 2014

brson commented Feb 4, 2014

zkamsler commented Feb 5, 2014

huonw commented Feb 6, 2014

zkamsler commented Feb 6, 2014

huonw commented Feb 6, 2014

zkamsler commented Feb 7, 2014

zkamsler commented Feb 7, 2014

Reduced allocations in merge_sort for short vectors #12029

Reduced allocations in merge_sort for short vectors #12029

Conversation

zkamsler commented Feb 4, 2014

emberian commented Feb 4, 2014

huonw commented Feb 4, 2014

huonw commented Feb 4, 2014

huonw Feb 4, 2014

Choose a reason for hiding this comment

zkamsler Feb 5, 2014

Choose a reason for hiding this comment

huonw Feb 5, 2014

Choose a reason for hiding this comment

brson commented Feb 4, 2014

zkamsler commented Feb 5, 2014

huonw commented Feb 6, 2014

zkamsler commented Feb 6, 2014

huonw commented Feb 6, 2014

zkamsler commented Feb 7, 2014

zkamsler commented Feb 7, 2014