详解C++之unordered_map容器

Last Updated: 2023-12-30 12:38:13 Saturday

-- TOC --

unordered_map容器,就是内部对象没有顺序的map容器,其查找性能要比map容器要快,虽然遍历顺序不稳定,但它更加常用。C++对其的实现,是基于Hash表。本文总结一些日常编程遇到的操作unordered_map容器的代码写法。

按key访问或创建value对象

对于unordered_map对象,在使用[]操作符按key访问value的时候,key可以不存在,此时对应此key的value会被创建,是一个空对象(value在访问时创建的)。使用at接口时,如果key不存在,会throw。

测试代码:

#include <iostream>
#include <unordered_map>
using namespace std;

int main() {
    unordered_map<int,int> ii {{1,2},{3,4},{5,6},{7,8}};
    cout << ii.contains(2) << endl;  // 0, not exist
    cout << ii[2] << endl;           // 0, create at access
    cout << ii.contains(2) << endl;  // 1, exist
    try{
        cout << ii.at(9) << endl;    // throw std::out_of_range
    }catch(out_of_range&){
        cerr << __FILE__ << ":" << __LINE__ << ", out of range" << endl;
    }
    return 0;
}

Unlike with sequential containers, operator[] won’t cause undefined behavior if the key doesn’t exist; instead, it will (silently) default construct a Value and insert the corresponding key/­value pair into the map, even if you only intended to perform a read.

从C++20开始,可以使用contains成员接口判断key是否存在。

这个特性支持了一种很nice的代码风格:

unordered_map<string,size_t> wneeds;
for(auto &w: words)
    wneeds[w] += 1;  // directly +1 to count

Create by Access needs zero-parameter constructor

前一段的测试代码,说明了value对象在通过create by access机制创建的时候,需要有zero-parameter constructor,否则代码会异常。

测试代码:

#include <unordered_map>
using namespace std;

struct xyz{
    int a;
    xyz(int a): a{a} {}
};

int main() {
    unordered_map<int,xyz> ixyz;
    ixyz[1];   // call xyz's zero-parameter constructor
    return 0;
}

用clang++编译,它给出的错误信息比较容易看明白:

$ clang++ test.cpp
In file included from test.cpp:1:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/unordered_map:46:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/hashtable.h:35:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/hashtable_policy.h:34:
/usr/bin/../lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/tuple:1824:2: error: no matching constructor for initialization of 'xyz'
        second(std::forward<_Args2>(std::get<_Indexes2>(__tuple2))...)
...

error: no matching constructor for initialization of 'xyz'

插入相同key的entry

使用[]赋值的效果是覆盖原value,若key不存在,则是用默认值创建一个新元素,前文已经介绍完毕。而使用insert或用emplace接口插入entry的时候,如果key值已经在unordered_map容器中存在,则无效,操作不会成功,同时也没有异常!

测试代码:

#include <iostream>
#include <unordered_map>
using namespace std;

int main(void) {
    unordered_map<int,int> ii;
    ii[1] = 1;              // init to 1
    ii.insert({1,2});       // fail silently
    cout << ii[1] << endl;  // 1
    ii.emplace(1,3);        // fail silently
    cout << ii[1] << endl;  // 1
    ii[1] = 9;
    cout << ii[1] << endl;  // update to 9
    return 0;
}

用自定义的struct作key

如果使用自定义结构体类型作为unordered_map的key,此时需要自定义一个hash函数对象和此类型的operator==。注意:这两个接口的signature都有const属性,都是accessor。

#include <iostream>
#include <unordered_map>
using namespace std;

struct ball{
    string name;
    float price;
    bool operator ==(const ball &b) const noexcept{
        return name==b.name && price==b.price;
    }
};

struct ball_hash{
    size_t operator ()(const ball &b) const noexcept{
        size_t h1 {hash<string>{}(b.name)};
        size_t h2 {hash<float>{}(b.price)};
        return h1^h2;
    }
};

int main(){
    unordered_map<ball,string,ball_hash> udict;
    udict.insert({{"aa",1.11}, "bob"});
    udict.insert({{"bb",2.22}, "tom"});
    udict.insert({{"aa",3.33}, "cat"});
    udict[{"cc",4.5}] = "fish";  // it's ok
    for(auto &[k,v]: udict)
        cout << k.name << " " << k.price << " " << v << endl;
    return 0;
}

用指针做key是个非常不错的选择!这样就避免了所有这些麻烦事儿!

遍历unordered_map对象

通过iterator或者range-based loop,都可以实现遍历unordered_map对象,后者更优雅。

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;


int main(void) {
    unordered_map<int,string> umis;
    umis[0] = "0000";
    umis[1] = "1111";
    umis[2] = "2222";
    umis[3] = "3333";

    // traverse by iterator
    for(auto it{umis.begin()}; it!=umis.end(); ++it)
        cout << it->first << ":" << it->second << endl;

    // traverse in range-based loop
    for(const auto &[k,v]: umis)
        cout << k << ":" << v << endl;

    return 0;
}

在遍历过程中删除entry

erase成员接口返回下一个iterator,注意:下一个iterator有可能等于end。示例:

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main(void) {
    unordered_map<int,string> umis;
    umis[0] = "0000";
    umis[1] = "1111";
    umis[2] = "2222";
    umis[3] = "3333";

    for(auto it{umis.begin()}; it!=umis.end();){
        if(it->first == 1){
            it = umis.erase(it);  // delete it, move it to next valid entry
            continue;
        }
        cout << it->first << ":" << it->second << endl;
        ++it;
    }

    for(const auto &[k,v]: umis)
        cout << k << ":" << v << endl;

    return 0;
}

理解Bucket

Bucket表示unordered_map中的一个位置,如果hash发生冲突,一个bucket内会存放多个元素(用链表或红黑树实现)。

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main(void) {
    unordered_map<string,int> ump;
    for(int i=0; i<24; ++i)
        ump[to_string(i)] = i;
    cout << ump.bucket_count() << endl;
    for(int i=0; i<ump.bucket_count(); ++i)
        cout << i << ":" << ump.bucket_size(i) << endl;
    cout << ump.max_bucket_count() << endl;
    return 0;
}

输出:

29
0:0
1:0
2:0
3:1
4:0
5:2
6:0
7:2
8:1
9:0
10:2
11:1
12:1
13:0
14:0
15:0
16:1
17:1
18:1
19:0
20:1
21:0
22:2
23:4
24:2
25:0
26:2
27:0
28:0
164703072086692425

24个元素对应29个bucket,23号bucket内,挤了4个元素!好些个bucket都是空的,这就是hash的效果,冲突了。

负载因子(Load Factor)

这是unordered_map的负载因子。影响Hash表性能的重要因素是conflict,越多的conflict,性能就越差。load factor = size/bucket_count,这个比值越小,发生conflict的概率较小,性能就越好,但内存占用会更多。

默认的load factor = 1.0,可以设置,类型是float。从提高性能的角度看,调小这个值,性能越好,内存占用越多。

下面是测试代码:

#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;

int main() {
    unordered_map<int,int> ii;
    ii.max_load_factor(0.618);   // set
    cout << 0 << ", load factor:" << ii.load_factor()\  // read
              << ", size:" << ii.size()\
              << ", bucket:" << ii.bucket_count() << endl;
    for(int i{1}; i<64; ++i){
        ii[i] = i;
        cout << i << ", load factor:" << ii.load_factor()\
                  << ", size:" << ii.size()\
                  << ", bucket:" << ii.bucket_count() << endl;
    }
    return 0;
}

输出:

0, load factor:0, size:0, bucket:1
1, load factor:0.0526316, size:1, bucket:19
2, load factor:0.105263, size:2, bucket:19
3, load factor:0.157895, size:3, bucket:19
4, load factor:0.210526, size:4, bucket:19
5, load factor:0.263158, size:5, bucket:19
6, load factor:0.315789, size:6, bucket:19
7, load factor:0.368421, size:7, bucket:19
8, load factor:0.421053, size:8, bucket:19
9, load factor:0.473684, size:9, bucket:19
10, load factor:0.526316, size:10, bucket:19
11, load factor:0.578947, size:11, bucket:19
12, load factor:0.292683, size:12, bucket:41
13, load factor:0.317073, size:13, bucket:41
14, load factor:0.341463, size:14, bucket:41
15, load factor:0.365854, size:15, bucket:41
16, load factor:0.390244, size:16, bucket:41
17, load factor:0.414634, size:17, bucket:41
18, load factor:0.439024, size:18, bucket:41
19, load factor:0.463415, size:19, bucket:41
20, load factor:0.487805, size:20, bucket:41
21, load factor:0.512195, size:21, bucket:41
22, load factor:0.536585, size:22, bucket:41
23, load factor:0.560976, size:23, bucket:41
24, load factor:0.585366, size:24, bucket:41
25, load factor:0.609756, size:25, bucket:41
26, load factor:0.313253, size:26, bucket:83
27, load factor:0.325301, size:27, bucket:83
28, load factor:0.337349, size:28, bucket:83
29, load factor:0.349398, size:29, bucket:83
30, load factor:0.361446, size:30, bucket:83
31, load factor:0.373494, size:31, bucket:83
32, load factor:0.385542, size:32, bucket:83
33, load factor:0.39759, size:33, bucket:83
34, load factor:0.409639, size:34, bucket:83
35, load factor:0.421687, size:35, bucket:83
36, load factor:0.433735, size:36, bucket:83
37, load factor:0.445783, size:37, bucket:83
38, load factor:0.457831, size:38, bucket:83
39, load factor:0.46988, size:39, bucket:83
40, load factor:0.481928, size:40, bucket:83
41, load factor:0.493976, size:41, bucket:83
42, load factor:0.506024, size:42, bucket:83
43, load factor:0.518072, size:43, bucket:83
44, load factor:0.53012, size:44, bucket:83
45, load factor:0.542169, size:45, bucket:83
46, load factor:0.554217, size:46, bucket:83
47, load factor:0.566265, size:47, bucket:83
48, load factor:0.578313, size:48, bucket:83
49, load factor:0.590361, size:49, bucket:83
50, load factor:0.60241, size:50, bucket:83
51, load factor:0.614458, size:51, bucket:83
52, load factor:0.311377, size:52, bucket:167
53, load factor:0.317365, size:53, bucket:167
54, load factor:0.323353, size:54, bucket:167
55, load factor:0.329341, size:55, bucket:167
56, load factor:0.335329, size:56, bucket:167
57, load factor:0.341317, size:57, bucket:167
58, load factor:0.347305, size:58, bucket:167
59, load factor:0.353293, size:59, bucket:167
60, load factor:0.359281, size:60, bucket:167
61, load factor:0.365269, size:61, bucket:167
62, load factor:0.371257, size:62, bucket:167
63, load factor:0.377246, size:63, bucket:167

每当load factor突破0.618时,bucket都会自动增加,降低load factor。

reserve和rehash

对于unordered_map对象来说,每当load factor突破max_load_factor时,bucket都会自动增加到下一个档位。此时,会有一个看不见的rehash的过程,即将所有entry重新hash到新申请的更大的bucket空间中去。reserve和rehash成员,就是手动实现这一过程的接口。

reserve(count),Sets the number of buckets to the number needed to accommodate at least count elements without exceeding maximum load factor and rehashes the container. Effectively calls rehash(std::ceil(count / max_load_factor())).

rehash(n), changes the number of buckets to a value n that is not less than count and satisfies n >= size() / max_load_factor(), then rehashes the container.

rehash(0)有shrink_to_fit的效果。

本文链接:https://cs.pynote.net/sf/c/cpp/202210311/

-- EOF --

-- MORE --