0%

跑通Beit V1的语义分割任务

Github仓库地址

Configure the virtual environment

  1. 首先,要确保环境为cuda 11.0,cuda版本是为之后mmcv-full 1.3.0(包含所有的特性以及丰富的开箱即用的CPU 和CUDA 算子)安装做准备。显卡驱动版本要支持这个cuda版本(笔者使用腾讯云 V100 GPU云服务器,下图是GPU驱动和cuda版本)image-20230820040252534

  2. conda create创建一个Python 3.8的虚拟环境,也是为mmcv-full安装做准备

  3. pip install torch=1.7.1+cu110:这是最优解,但一开始没有意识到,之后会详细解释。所以我的第二步是进入unlim/beit目录下pip install -r requirements.txt

  4. 根据unlim/beit/semantic_segmentation/README.md的tutor安装mmcv-full, mmseg, timm, apex

    image-20230820040928616

  5. 但这里安装到apex就出错了,由于pip在之前创建虚拟环境时已经更新至最新版23.2.1, --global-option has already been deprecated,并且显示packaging模块找不到。此时去Nvidia的apex官方仓库,找到了新的命令:

    1
    2
    3
    4
    5
    6
    7
    8
    git clone https://github.com/NVIDIA/apex
    cd apex
    # if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
    # otherwise
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext"
    # only python
    pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./
  6. 按照第一个命令,安装后import apex仍然会报错:AttributeError: module 'torch.distributed' has no attribute '_all_gather_base',这是因为torch版本低,但是torch==1.7.1是实验要求不可更改,根据github apexissue 1532,要去找branch里的老版本apex –> apex-22.04-dev

  7. 再安装apex,又比较奇怪,这个版本的apex不支持--config-settings,但使用--global-option对应的命令是可以安装完整版的(cuda+cpp+python),也不会报错packaging module not found

    packaging module not found这个报错在issue 1679中讨论到了,并有pull request 1680,解决方案是在pyproject.toml中修改:

    image-20230820091717667

    以我之前安装的步骤,这样修改完,再安装apex,会报torch not found,我没有详细论证,但推测和torch 1.7.1有关,这也是接下来要说的

  8. 安装仍然有一个报错,就是关于torch的,因为torch.version.cuda的版本(10.2)和真正使用的cuda版本(11.0)不同,所以需要安装对应cuda版本的torch,我的方法是去pytorch官方下载对应的.whl包,并pip安装,之后就可以安装了

  9. 接下来就是比较常规的内容:

    1. beit/semantic_segmentation/mmcv_custom添加到环境变量,或者直接复制到对应虚拟环境的site_packages文件夹下,这是因为mmcv_custom调用apex,实现了['IterBasedRunnerAmp','LayerDecayOptimizerConstructor', 'SETR_Resize', 'DistOptimizerHook', 'train_segmentor']这些训练用的组件
    2. 还有就是将mmseg/core/evaluation/metrics.py中的np.float改为float,因为不再支持np.float
  10. 参数:batch size(samples_per_gpu) = 4,workers = 8,占用显存达到98%

  11. 目前仍待解决的问题:(北京时间 8/20 9:50 更新)

    1. 直接运行报错Invoked 'with amp.scale_loss', but internal Amp state has not been initialized,根据apex官方给出的文档,,apex只需要三行代码就可以优雅地调用Amp

      image-20230820093253041

      我们在mmcv_custom/tran_api.py找到了初始化的实现:这里有一个疑惑,为什么cfg.optimizer_config.get("use_fp16", False)下还要初始化Amp为混合精度(根据官方文档,opt_level="O1"即为混合精度,而正因为use_fp16==True,才成为混合精度(乘法用fp16,加法用fp32,混合精度的好处是可使显存,训练时间大幅减少,但不明显损害训练精度),这一点目前没搞懂。

      并且,从现有的10000 iterations看,训练过程中震荡很明显,速度很慢(160k iters训练预计使用4 days),batch size也只能开到4(32G显存 单卡),并不像真正使用了Amp加速训练,这也引出了第二个问题

      1
      2
      3
      4
      5
      6
      7
      8
      # use apex fp16 optimizer
      if cfg.optimizer_config.get("type", None) and cfg.optimizer_config["type"] == "DistOptimizerHook":
      if cfg.optimizer_config.get("use_fp16", False):
      model, optimizer = apex.amp.initialize(
      model.cuda(), optimizer, opt_level="O1")
      for m in model.modules():
      if hasattr(m, "fp16_enabled"):
      m.fp16_enabled = True
    2. 本实验是否真的能通过单GPU训练达到复现,因为apex工具包的内容全部关于分布式训练,而分布式训练核心在于将不同计算核心开不同的进程,增大并行量,单GPU使用分布式训练(目前我实验使用的方法,因为我感觉代码,尤其是mmcv_custom不容易与分布式训练脱钩)。使用ps aux命令查看进程信息:

      image-20230820094805062image-20230820094732140

      可以看出与多进程相关的任务很多,但真正在服务于训练的只有PID 32444的进程,所以分布式训练可能很难真正起作用,因为计算核心只有一个包括即使修复了第一个问题,可以使用混合精度,是否会真正性能提升也是未知数

  12. 解决未使用fp16的问题:(8/20 14:28更新)

    1. 之前运行dist_train.sh脚本,总提示

      1
      RuntimeError: Invoked 'with amp.scale_loss', but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before with `amp.scale_loss`.

      这个报错表示:model和optimizer这两个internal Amp state初始化必须在with amp.scale_loss之前。而with amp.scale_loss的调用在site-packages文件夹下mmseg/apis/train.pytrain_segmentor类中。故我们一定要修改它,mmcv_custom(mmcv_custom提供了一系列在本次实验中适用的api)中,我们找到了适合的train_segmentor类

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      '''
      1. 将`optimizer = build_optimizer(model, cfg.optimizer)`放在`if distributed:`之前
      2. 将初始化Amp state写在optimizer创建和`if distributed:`之间
      '''
      # use apex fp16 optimizer
      if cfg.optimizer_config.get("type", None) and cfg.optimizer_config["type"] == "DistOptimizerHook":
      if cfg.optimizer_config.get("use_fp16", False): # False没懂
      model, optimizer = apex.amp.initialize(
      model.cuda(), optimizer, opt_level="O1")
      for m in model.modules():
      if hasattr(m, "fp16_enabled"):
      m.fp16_enabled = True

      mmseg/apis/train.py中,我写成了:(因为只要分布式训练,我就用amp)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      optimizer = build_optimizer(model, cfg.optimizer)
      # put model on gpus
      if distributed:
      find_unused_parameters = cfg.get('find_unused_parameters', False)
      # Sets the `find_unused_parameters` parameter in
      # torch.nn.parallel.DistributedDataParallel
      model, optimizer = apex.amp.initialize(model.cuda(), optimizer, opt_level="O1")
      model = MMDistributedDataParallel(
      model.cuda(),
      device_ids=[torch.cuda.current_device()],
      broadcast_buffers=False,
      find_unused_parameters=find_unused_parameters)

      结果:训练时间缩短为一半,但部分loss出现梯度溢出的情况,且显存消耗依然很大,仍有很多需改进的部分

      最开始训练时梯度溢出->推测是: 一开始loss本身过大

      网上常见的梯度溢出的可能:

      1. mask_fill使用-1e9 mask的问题
      2. softmax的溢出
      3. sum函数的溢出

      为什么我不认为我的是这些常见问题之一:因为这个model的实现没有真正用到fp16,而是一直在用fp32,只有optimizer使用到了fp16,这也是为什么显存始终占用98%,baych size无法提升

  13. 解决无法validation的问题:

    IterBasedRunnerAmp是现用的runner,按照mmcv_custom的写法,它将被注册到mmcv的registry类当中,但在mmseg的apis/train.py的第110行左右,内容是eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner',在这里,IterBaseRunnerAmp并不会被识别为IterBasedRunner,尽管实际上它是。所以,这里我直接改成了eval_cfg['by_epoch'] = False

    validation之后显存不足 -> new issue -> 目前先把validation设为160000iterations做一次,打算先花2天跑完 Training,只看最后的validation结果

Results

image-20230822161403800

差距主要由于batch size(我是4 batchs/GPU * 1 GPU,github是2 batchs/GPU * 8 GPUs)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
+---------------------+-------+-------+
| Class | IoU | Acc |
+---------------------+-------+-------+
| wall | 79.13 | 88.61 |
| building | 83.17 | 93.27 |
| sky | 94.67 | 97.6 |
| floor | 82.83 | 89.8 |
| tree | 75.83 | 87.13 |
| ceiling | 85.1 | 92.23 |
| road | 85.48 | 91.92 |
| bed | 90.11 | 96.59 |
| windowpane | 63.83 | 77.87 |
| grass | 64.76 | 78.89 |
| cabinet | 62.13 | 73.76 |
| sidewalk | 69.88 | 82.98 |
| person | 82.44 | 92.74 |
| earth | 35.82 | 49.7 |
| door | 51.1 | 65.85 |
| table | 60.89 | 74.4 |
| mountain | 59.21 | 72.16 |
| plant | 53.45 | 64.69 |
| curtain | 74.71 | 87.09 |
| chair | 61.9 | 74.23 |
| car | 85.93 | 93.92 |
| water | 60.88 | 75.76 |
| painting | 74.08 | 88.03 |
| sofa | 70.52 | 85.48 |
| shelf | 46.76 | 70.3 |
| house | 48.95 | 65.93 |
| sea | 66.64 | 83.46 |
| mirror | 70.01 | 79.44 |
| rug | 68.24 | 78.04 |
| field | 27.0 | 45.61 |
| armchair | 49.67 | 68.81 |
| seat | 61.57 | 85.05 |
| fence | 50.39 | 66.54 |
| desk | 52.51 | 73.18 |
| rock | 47.69 | 77.52 |
| wardrobe | 52.41 | 74.16 |
| lamp | 64.4 | 76.15 |
| bathtub | 79.66 | 84.88 |
| railing | 38.76 | 55.33 |
| cushion | 58.95 | 72.0 |
| base | 35.35 | 47.07 |
| box | 30.96 | 39.84 |
| column | 48.65 | 59.64 |
| signboard | 39.84 | 52.53 |
| chest of drawers | 42.18 | 58.8 |
| counter | 25.42 | 31.12 |
| sand | 50.92 | 78.12 |
| sink | 71.8 | 81.29 |
| skyscraper | 54.8 | 75.97 |
| fireplace | 71.98 | 89.09 |
| refrigerator | 74.0 | 86.95 |
| grandstand | 51.51 | 82.02 |
| path | 27.65 | 38.79 |
| stairs | 24.23 | 32.11 |
| runway | 69.69 | 93.39 |
| case | 54.84 | 72.89 |
| pool table | 93.59 | 97.55 |
| pillow | 60.04 | 71.62 |
| screen door | 78.78 | 84.89 |
| stairway | 32.42 | 44.96 |
| river | 12.36 | 26.78 |
| bridge | 47.98 | 56.49 |
| bookcase | 42.2 | 61.27 |
| blind | 44.14 | 52.09 |
| coffee table | 56.83 | 81.54 |
| toilet | 77.65 | 90.81 |
| flower | 40.64 | 56.71 |
| book | 50.39 | 68.83 |
| hill | 8.76 | 12.16 |
| bench | 44.48 | 52.77 |
| countertop | 53.86 | 71.89 |
| stove | 73.33 | 85.42 |
| palm | 54.7 | 81.22 |
| kitchen island | 36.69 | 77.11 |
| computer | 76.68 | 93.44 |
| swivel chair | 50.45 | 70.19 |
| boat | 55.72 | 70.42 |
| bar | 46.97 | 63.47 |
| arcade machine | 77.32 | 83.73 |
| hovel | 44.66 | 47.13 |
| bus | 88.91 | 96.9 |
| towel | 66.36 | 81.26 |
| light | 54.84 | 61.62 |
| truck | 38.17 | 51.35 |
| tower | 27.19 | 45.83 |
| chandelier | 67.73 | 84.32 |
| awning | 34.73 | 40.09 |
| streetlight | 29.22 | 38.42 |
| booth | 42.93 | 48.72 |
| television receiver | 71.22 | 81.2 |
| airplane | 62.18 | 69.12 |
| dirt track | 12.35 | 26.75 |
| apparel | 36.96 | 50.77 |
| pole | 22.87 | 30.81 |
| land | 4.12 | 7.3 |
| bannister | 15.48 | 20.49 |
| escalator | 58.48 | 80.13 |
| ottoman | 53.12 | 72.89 |
| bottle | 38.93 | 60.38 |
| buffet | 54.7 | 64.62 |
| poster | 41.08 | 49.67 |
| stage | 15.45 | 27.8 |
| van | 44.31 | 57.28 |
| ship | 52.51 | 77.15 |
| fountain | 28.03 | 29.07 |
| conveyer belt | 64.03 | 93.28 |
| canopy | 40.2 | 55.69 |
| washer | 82.07 | 86.46 |
| plaything | 27.36 | 41.42 |
| swimming pool | 79.88 | 92.08 |
| stool | 46.4 | 56.69 |
| barrel | 41.77 | 64.89 |
| basket | 35.7 | 47.11 |
| waterfall | 70.1 | 82.53 |
| tent | 90.61 | 99.4 |
| bag | 16.3 | 18.63 |
| minibike | 69.95 | 84.3 |
| cradle | 79.32 | 97.42 |
| oven | 53.71 | 66.81 |
| ball | 51.44 | 63.86 |
| food | 56.48 | 71.72 |
| step | 13.23 | 14.56 |
| tank | 55.92 | 67.91 |
| trade name | 23.79 | 26.3 |
| microwave | 84.24 | 93.93 |
| pot | 40.29 | 45.21 |
| animal | 61.39 | 64.23 |
| bicycle | 55.97 | 73.42 |
| lake | 53.18 | 63.79 |
| dishwasher | 60.84 | 72.38 |
| screen | 61.71 | 87.5 |
| blanket | 16.01 | 19.36 |
| sculpture | 64.79 | 82.8 |
| hood | 63.25 | 70.51 |
| sconce | 52.4 | 63.6 |
| vase | 42.19 | 56.82 |
| traffic light | 30.93 | 54.64 |
| tray | 4.9 | 7.15 |
| ashcan | 41.46 | 55.92 |
| fan | 61.34 | 76.1 |
| pier | 35.28 | 43.99 |
| crt screen | 17.12 | 23.36 |
| plate | 56.08 | 73.78 |
| monitor | 52.86 | 64.91 |
| bulletin board | 53.05 | 63.52 |
| shower | 0.0 | 0.0 |
| radiator | 59.34 | 65.24 |
| glass | 16.52 | 17.8 |
| clock | 41.97 | 48.47 |
| flag | 60.88 | 69.39 |
+---------------------+-------+-------+