random thoughts

Go back to the index.

Someone's Wrong! Membership testing in practical cases

By Antoine Grondin, Saturday March 22, 2014 - Is there an error in this post? Submit a fix here.

Obviously, when someone’s wrong on the internet, you’ve gotta do something about it. In this case, an individual on StackOverflow mentionned that using maps for membership testing was slower than simply iterating over an array, for $n$ small enough.

Their argument was that the cost of hashing the value was higher than doing a lookup in an array, for most practical sizes. In their words, practical size meant until over a million values.

Now you might be wondering why I’m not directly linking to the comment in question. The reason is that I can’t find it anymore. I only remember the comment troubled me enough to make me shout “Someone’s wrong on the internet!”

Someone's wrong on the internet!

But I had doubts; their argument could have made sense. I mean, maybe the cost of hashing is greater than iterating and comparing for $n$ smaller than $something$. I had a gut feeling it was crap, but I’m not pretentious enough to affirm it was crap before actually verifying it was crap.

All in all, finding the comment (and telling the wrongdoer they’re wrong!) doesn’t matter. What matters is The Truth. In this post, we explore the truth using the Go Programming Language, the greatest language of all (no hyperbole here).

TL;DR

Obviously this individual was wrong. Verified for $n>1$, membership testing on a map is always faster than on an slice. This means, in all cases you should not use a slice instead of a map.

<insert fancy graph here>

update: the tests that follow assume sets of string types. The same tests with int types reveal that slices are slightly faster until $n \approx 30$.

Problem

Membership testing consists of asking a datastructure whether it contains a value or not. Of the many ways to implement this, two are discussed here:

Map

Use a map[value]bool, then check if a value is in the map:

func isMapMember(m map[string]bool, key string) bool {
  _, ok := m[key]
  return ok
}

Slice

Use a []value, then iterate over all the values to check if one of them is in the slice:

func isSliceMember(s []string, key string) bool {
  for _, entry := range s {
    if key == entry {
      return true
    }
  }
  return false
}

Question

Which one of them is fastest?

Hypothesis

My hypothesis is that using a slice will be faster. I like trying to prove I’m wrong.

Prediction

If my hypothesis is indeed right, there will be a $n$ for which using a slice will be faster than using a map. This will mean the individual was right. For many use cases, membership testing will be done on sets that contain a few values.

Testing

I can think of 3 4 dimensions that might affect the results.

  1. $n$, the size of the set being tested. The claim here is that for $n$ small enough, a slice will be faster.
  2. $valSize$, the size of the individual values stored in the set. As $valSize$ increase, it is possible that the structures will perform differently than with smaller $valSize$.
  3. Whether or not the entry is in the set. It could be that map are faster at determining non-membership. Or slices. Who knows!
  4. update: the type of the values held in the set.

Methodology

Taking into consideration the above dimensions, we will benchmark the two methods of testing for membership.

The first step is to create a set of $n$ unique entries, each randomly generated strings of length $valSize$. The actual benchmarking code separates those in two funcs, so to run one type of benchmark at a time.

func makeRandomSet(n, valSize int) (map[string]bool, []string) {
  mapset := make(map[string]bool, n)
  sliceset := make([]string, 0, n)

  for len(mapset) != n {
    str := makeRandomString(valSize)
    if _, ok := mapset[str]; !ok {
      mapset[str] = true
      sliceset = append(sliceset, str)
    }
  }
  return mapset, sliceset
}

Then, we have two paths: asserting membership, asserting non-membership.

Membership

Take a sample from the set:

func sampleInSlice(sliceset []string) {
  randIdx := rand.Intn(len(sliceset))
  return sliceset[randIdx]
}

Sampling in the map is done the same way, but you first need to extract the map keys into a slice. For benchmarking purposes, this function actually returns a slice of unique samples, and then measure how much time it takes to assert the membership of each.

We will use this helper to benchmark maps:

func benchMapMembers(b *testing.B, size int, keySize int) {
  m := makeRandomMap(size, keySize)
  samples := sampleFromMap(m, b.N)

  var member string
  var ok bool
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    member = samples[i]
    ok = isMapMember(m, member)
  }
  _ = ok
}

And this one to benchmark slices:

func benchSliceMembers(b *testing.B, size int, keySize int) {
  s := makeRandomSlice(size, keySize)
  samples := sampleFromSlice(s, b.N)

  var member string
  var ok bool
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    member = samples[i]
    ok = isSliceMember(s, member)
  }
  _ = ok
}

Non-membership

Find entries that aren’t in the set:

func sampleNotInMap(mapset map[string]bool, valSize int) (sample string) {
  for {
    str := makeRandomString(valSize)
    if !mapset[str] {
      return str
    }
  }
}

Sampling in the slice is done the same way. Then we also create benchmark helpers that are quite similar to the membership helpers.

For maps:

func benchMapNotMembers(b *testing.B, size int, keySize int) {
  m := makeRandomMap(size, keySize)
  samples := sampleNotInMap(m, b.N, keySize)

  var member string
  var ok bool
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    member = samples[i]
    ok = isMapMember(m, member)
  }
  _ = ok
}

For slices:

func benchSliceMembers(b *testing.B, size int, keySize int) {
  s := makeRandomSlice(size, keySize)
  samples := sampleFromSlice(s, b.N)

  var member string
  var ok bool
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    member = samples[i]
    ok = isSliceMember(s, member)
  }
  _ = ok
}

Faceoff!

We run the membership helpers using a series of benchmarks, with $n$ in $(2,3,4,5,6,7,8,9,10, 100, 1000, 10000, 100000, 1000000)$ and $valSize$ in $(10, 100)$.

func BenchmarkMap_2key_10bytes(b *testing.B)        { benchMapMembers(b, 2, 10) }
func BenchmarkMap_3key_10bytes(b *testing.B)        { benchMapMembers(b, 3, 10) }
...
func BenchmarkMap_1000000key_100bytes(b *testing.B) { benchMapMembers(b, 1000000, 100) }

func BenchmarkSlice_2key_10bytes(b *testing.B)        { benchSliceMembers(b, 2, 10) }
func BenchmarkSlice_3key_10bytes(b *testing.B)        { benchSliceMembers(b, 3, 10) }
...
func BenchmarkSlice_1000000key_100bytes(b *testing.B) { benchSliceMembers(b, 1000000, 100) }

Testing for non-membership is much more expensive than testing for membership, since we need to create random strings that aren’t in the original set. This is very time consuming to setup, so I’ve limited this part of the test to a few values, and assume that the behavior of non-membership will be consistent with the membership ones. This is an acceptable assumption given that the experiment I did measure suggest that they are indeed.

Also, Go will kill benchmarks running for more than 600s (10min). The above combinations run is just short of 600s (584s).

func BenchmarkMapNot_10key_10bytes(b *testing.B)      { benchMapNotMembers(b, 10, 10) }
func BenchmarkMapNot_1000key_10bytes(b *testing.B)    { benchMapNotMembers(b, 1000, 10) }
func BenchmarkMapNot_1000000key_10bytes(b *testing.B) { benchMapNotMembers(b, 1000000, 10) }

func BenchmarkSliceNot_10key_10bytes(b *testing.B)      { benchSliceNotMembers(b, 10, 10) }
func BenchmarkSliceNot_10000key_10bytes(b *testing.B)   { benchSliceNotMembers(b, 10000, 10) }
func BenchmarkSliceNot_1000000key_10bytes(b *testing.B) { benchSliceNotMembers(b, 1000000, 10) }

Results

The code to run the benchmarks yourself is on github. The results of running this benchmark on my Mac follow.

Membership

$n$ $valSize$ Measurements (slice) Measurements (map) ns/op (slice) ns/op (map)
2 10 100000000 100000000 17.0 14.4
3 10 100000000 100000000 23.6 18.8
4 10 100000000 100000000 28.6 22.5
5 10 50000000 100000000 33.5 26.3
6 10 50000000 100000000 40.8 30.1
7 10 50000000 50000000 45.4 33.5
8 10 50000000 50000000 52.4 36.0
9 10 50000000 50000000 54.2 36.6
10 10 50000000 50000000 60.4 38.1
100 10 5000000 50000000 492 39.3
1000 10 500000 50000000 4829 39.3
10000 10 50000 50000000 48172 48.4
100000 10 5000 50000000 480773 77.0
1000000 10 500 20000000 4853658 149
2 100 100000000 100000000 16.9 12.4
3 100 100000000 100000000 23.9 15.9
4 100 100000000 100000000 29.4 18.0
5 100 50000000 100000000 36.2 20.8
6 100 50000000 100000000 42.0 22.6
7 100 50000000 100000000 45.4 24.6
8 100 50000000 100000000 50.9 26.8
9 100 50000000 50000000 56.5 67.7
10 100 50000000 50000000 61.3 68.0
100 100 5000000 50000000 518 68.8
1000 100 500000 50000000 4940 70.2
10000 100 50000 20000000 49396 81.0
100000 100 5000 10000000 608481 172
1000000 100 200 10000000 6361138 199

Non-membership

$n$ $valSize$ Measurements (slice) Measurements (map) ns/op (slice) ns/op (map)
10 10 20000000 50000000 97.7 36.5
10000 10 20000 50000000 85053 36.9
1000000 10 200 20000000 9101328 123

The results are clear, my hypothesis was wrong. For $n > 1$, membership testing is always faster using a map.

So clearly, someone was wrong on the Internet!!!!

Update

The above was written considering string as the type held by the set. It turns out that for sets of int, slices are slightly faster than maps until $n \approx 30$.

Integer membership

$n$ Measurements (slice) Measurements (map) ns/op (slice) ns/op (map)
22000000001000000009.0211.3
310000000010000000011.114.0
410000000010000000012.316.0
510000000010000000013.216.4
610000000010000000013.717.4
710000000010000000014.519.4
810000000010000000015.120.5
91000000005000000016.029.9
101000000005000000016.729.9
201000000005000000024.629.8
30500000005000000031.128.5
40500000005000000035.331.6
50500000005000000039.530.7
100500000005000000056.230.6
100050000005000000034029.8
1000050000050000000321232.6
10000050000500000003105140.4
100000050005000000033163074.7

Integer non-membership

$n$ Measurements (slice) Measurements (map) ns/op (slice) ns/op (map)
1010000000010000000018.025.4
10000500000100000000622025.5
100000050002000000071800680.7

Someone was not so clearly wrong on the Internet!!!!