Save bubble sort with iterator and unseve with raw pointers strange speed

292 views Asked by At

Recently I was playing with bubble sorting and saw a strange phenomenon that I can't understand.

For some reason, code on raw pointers is slower than if I use iterator (if run as cargo run without release). Here is a typical output of the program:

PS D:\RustProjects\bubble> cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target\debug\bubble.exe`
time_unoptimal   = 20.432832ms
time_unsafe_one = 17.976703ms
time_unsafe_two = 24.748904ms
time_iterator   = 16.05494ms 
number = 136280

If run as cargo run --release bubble_sort_unsafe_one, and bubble_sort_iterator show almost equal times, but bubble_sort_iterator faster a little. And the slowest is bubble_sort_unsafe_two. Why it is so?

PS D:\RustProjects\bubble> cargo run --release
   Compiling bubble v0.1.0 (D:\RustProjects\bubble)
    Finished release [optimized] target(s) in 1.32s
     Running `target\release\bubble.exe`
time_unoptimal   = 561.994µs
time_unsafe_one = 274.795µs
time_unsafe_two = 597.33µs 
time_iterator   = 265.299µs
number = 136280

Below is the code for the program. Some excess code to prevent the optimizer from throwing out all the sorting when building with the --release flag.

fn bubble_sort_unoptimal<T: Ord>(arr: &mut [T]) {
    let len = arr.len();
    for i in 0..len {
        for j in 0..(len - 1 - i) {
            if arr[j] > arr[j + 1] {
                arr.swap(j, j + 1);
            }
        }
    }
}

fn bubble_sort_unsafe_one<T: Ord>(arr: &mut [T]) {
    let len = arr.len();
    for i in 0..len {
        let ptr = arr.as_mut_ptr();
        for j in 0..(len - 1 - i) {
            unsafe {
                let left = ptr.add(j);
                let right = ptr.add(j+1);
                if *left > *right {
                    std::ptr::swap(left, right);
                }
            }
        }
    }
}

fn bubble_sort_unsafe_two<T: Ord>(arr: &mut [T]) {
    let len = arr.len();
    for i in 0..len {     
        for j in 0..(len - 1 - i) {
            unsafe {
                if arr.get_unchecked(j) > arr.get_unchecked(j + 1) {
                    let ptr = arr.as_mut_ptr();
                    std::ptr::swap(ptr.add(j), ptr.add(j + 1));
                }
            }
        }
    }
}

fn bubble_sort_iterator<T: Ord>(array: &mut [T]) {
    let len = array.len();
    for i in 0..len {
        let mut iter = array[..(len - i)].iter_mut();
        match iter.next() {
            Some(mut left) => loop {
                match iter.next() {
                    Some(right) => {
                        if left > right {
                            std::mem::swap(left, right)
                        }
                        left = right;
                    }
                    None => break
                }
            },
            None => {},
        }
    }
}

fn main() {
    use std::time::Instant;
    use std::time::Duration;
    
    let mut number: i32 = 0;

    let mut time_vec_iterator: Vec<Duration> = Vec::with_capacity(510);
    for i in 0..500 {
        let mut vec1: Vec<i32> = (1..=1000).rev().collect();
        let start = Instant::now();
        bubble_sort_iterator(&mut vec1);
        let duration = start.elapsed();
        time_vec_iterator.push(duration);
        number += vec1.into_iter().take(100).sum::<i32>() / (i + 1);
    }
    let time_iterator: Duration = time_vec_iterator.into_iter().sum::<Duration>() / 500;

    let mut time_vec_unoptimal: Vec<Duration> = Vec::with_capacity(510);
    for i in 0..500 {
        let mut vec1: Vec<i32> = (1..=1000).rev().collect();
        let start = Instant::now();
        bubble_sort_unoptimal(&mut vec1);
        let duration = start.elapsed();
        time_vec_unoptimal.push(duration);
        number += vec1.into_iter().take(100).sum::<i32>() / (i + 1);
    }
    let time_unoptimal: Duration = time_vec_unoptimal.into_iter().sum::<Duration>() / 500;

    let mut time_vec_unsafe_one: Vec<Duration> = Vec::with_capacity(510);
    for i in 0..500 {
        let mut vec1: Vec<i32> = (1..=1000).rev().collect();
        let start = Instant::now();
        bubble_sort_unsafe_one(&mut vec1);
        let duration = start.elapsed();
        time_vec_unsafe_one.push(duration);
        number += vec1.into_iter().take(100).sum::<i32>() / (i + 1);
    }
    let time_unsafe_one: Duration = time_vec_unsafe_one.into_iter().sum::<Duration>() / 500;

    let mut time_vec_unsafe_two: Vec<Duration> = Vec::with_capacity(510);
    for i in 0..500 {
        let mut vec1: Vec<i32> = (1..=1000).rev().collect();
        let start = Instant::now();
        bubble_sort_unsafe_two(&mut vec1);
        let duration = start.elapsed();
        time_vec_unsafe_two.push(duration);
        number += vec1.into_iter().take(100).sum::<i32>() / (i + 1);
    }
    let time_unsafe_two: Duration = time_vec_unsafe_two.into_iter().sum::<Duration>() / 500;

    println!("time_unoptimal   = {:?}", time_unoptimal);
    println!("time_unsafe_one = {:?}", time_unsafe_one);
    println!("time_unsafe_two = {:?}", time_unsafe_two);
    println!("time_iterator   = {:?}", time_iterator);

    println!("number = {}", number);
}

2

There are 2 answers

0
Esdeseserdt On

I believe the reason bubble_sort_two() is so slow is because of the number of pointers you grab from the array in the function- in bubble_sort_one() is much less. in bubble_sort_one, the mutable pointer to the array is grabbed once and then used on every element not yet sorted before being dropped. The references to the array elements are also reused in the swap statement, which improves performance. In bubble_sort_two, a new array pointer is grabbed every time two elements need to swap, and a new pointer is grabbed both for comparison and the swap statement.

0
Aiden4 On

I set up a benchmark with criterion and on my machine at least unsafe_one is a tiny bit faster than the iterator version.

Benchmarking bubble sort/unoptimal: Collecting 100 samples in estimated 5.9871 s (10k it

bubble sort/unoptimal   time:   [575.58 us 581.99 us 588.69 us]
Found 12 outliers among 100 measurements (12.00%)
  10 (10.00%) high mild
  2 (2.00%) high severe
Benchmarking bubble sort/unsafe 1: Collecting 100 samples in estimated 6.4862 s (20k it

bubble sort/unsafe 1    time:   [299.70 us 301.84 us 303.88 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking bubble sort/unsafe 2: Collecting 100 samples in estimated 6.6958 s (10k it

bubble sort/unsafe 2    time:   [606.89 us 607.86 us 608.88 us]
Found 19 outliers among 100 measurements (19.00%)
  13 (13.00%) high mild
  6 (6.00%) high severe
Benchmarking bubble sort/iterator: Collecting 100 samples in estimated 5.5519 s (15k it

bubble sort/iterator    time:   [323.87 us 324.12 us 324.41 us]

My benchmarking code:

use bubble_sort::*;
use criterion::{criterion_group, criterion_main, BatchSize, Criterion};

fn bench_bubble_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("bubble sort");
    group.bench_function("unoptimal", |bencher| {
        bencher.iter_batched_ref(setup, |v| bubble_sort_unoptimal(v), BatchSize::SmallInput)
    });
    group.bench_function("unsafe 1", |bencher| {
        bencher.iter_batched_ref(setup, |v| bubble_sort_unsafe_one(v), BatchSize::SmallInput)
    });
    group.bench_function("unsafe 2", |bencher| {
        bencher.iter_batched_ref(setup, |v| bubble_sort_unsafe_two(v), BatchSize::SmallInput)
    });
    group.bench_function("iterator", |bencher| {
        bencher.iter_batched_ref(setup, |v| bubble_sort_iterator(v), BatchSize::SmallInput)
    });
}
fn setup() -> Vec<i32> {
    (1..=1000).rev().collect()
}
criterion_group!(benches, bench_bubble_sort);
criterion_main!(benches);

Note that the various bubble sort functions were placed in lib.rs with a #[inline(never)] attribute tacked on.

Now to your main question: why is the iterator version so fast? The reason is that rustc is really good at optimizing iterators. As the comparison with the unoptimal version shows, iterators are faster than a for loop with explicit indexing because they don't have to deal with bounds checking. By the time all the fluff is compiled away, it will look very similar to unsafe_one, albeit with a slightly different algorithm. The fact that the relative performances are different on my and your computer probably comes down to differences in compiler versions and our computers.

I'm not entirely sure why unsafe_two is so slow, but you have somehow managed to trick the optimizer into doing more work than it needs to.

Moral of the story? Prefer iterators over indexing, and if you're concerned about performance always benchmark, especially if you are considering using some unsafe code to attempt to speed things up.