跑通Beit V1的语义分割任务

Configure the virtual environment

首先，要确保环境为cuda 11.0，cuda版本是为之后mmcv-full 1.3.0(包含所有的特性以及丰富的开箱即用的CPU 和CUDA 算子)安装做准备。显卡驱动版本要支持这个cuda版本（笔者使用腾讯云 V100 GPU云服务器，下图是GPU驱动和cuda版本）
conda create创建一个Python 3.8的虚拟环境，也是为mmcv-full安装做准备
pip install torch=1.7.1+cu110：这是最优解，但一开始没有意识到，之后会详细解释。所以我的第二步是进入unlim/beit目录下pip install -r requirements.txt
根据unlim/beit/semantic_segmentation/README.md的tutor安装mmcv-full, mmseg, timm, apex

但这里安装到apex就出错了，由于pip在之前创建虚拟环境时已经更新至最新版23.2.1, --global-option has already been deprecated，并且显示packaging模块找不到。此时去Nvidia的apex官方仓库，找到了新的命令：

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext"
# only python
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

按照第一个命令，安装后import apex仍然会报错：AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'，这是因为torch版本低，但是torch==1.7.1是实验要求不可更改，根据github apexissue 1532，要去找branch里的老版本apex –> apex-22.04-dev
再安装apex，又比较奇怪，这个版本的apex不支持--config-settings，但使用--global-option对应的命令是可以安装完整版的(cuda+cpp+python)，也不会报错packaging module not found

packaging module not found这个报错在issue 1679中讨论到了，并有pull request 1680，解决方案是在pyproject.toml中修改：

以我之前安装的步骤，这样修改完，再安装apex，会报torch not found，我没有详细论证，但推测和torch 1.7.1有关，这也是接下来要说的
安装仍然有一个报错，就是关于torch的，因为torch.version.cuda的版本(10.2)和真正使用的cuda版本(11.0)不同，所以需要安装对应cuda版本的torch，我的方法是去pytorch官方下载对应的.whl包，并pip安装，之后就可以安装了
接下来就是比较常规的内容：
1. 将beit/semantic_segmentation/mmcv_custom添加到环境变量，或者直接复制到对应虚拟环境的site_packages文件夹下，这是因为mmcv_custom调用apex，实现了['IterBasedRunnerAmp','LayerDecayOptimizerConstructor', 'SETR_Resize', 'DistOptimizerHook', 'train_segmentor']这些训练用的组件
2. 还有就是将mmseg/core/evaluation/metrics.py中的np.float改为float，因为不再支持np.float
参数：batch size(samples_per_gpu) = 4，workers = 8，占用显存达到98%

目前仍待解决的问题：(北京时间 8/20 9:50 更新)

直接运行报错Invoked 'with amp.scale_loss', but internal Amp state has not been initialized，根据apex官方给出的文档，，apex只需要三行代码就可以优雅地调用Amp

我们在mmcv_custom/tran_api.py找到了初始化的实现：这里有一个疑惑，为什么cfg.optimizer_config.get("use_fp16", False)下还要初始化Amp为混合精度(根据官方文档，opt_level="O1"即为混合精度，而正因为use_fp16==True，才成为混合精度(乘法用fp16，加法用fp32，混合精度的好处是可使显存，训练时间大幅减少，但不明显损害训练精度)，这一点目前没搞懂。

并且，从现有的10000 iterations看，训练过程中震荡很明显，速度很慢(160k iters训练预计使用4 days)，batch size也只能开到4(32G显存单卡)，并不像真正使用了Amp加速训练，这也引出了第二个问题

# use apex fp16 optimizer
if cfg.optimizer_config.get("type", None) and cfg.optimizer_config["type"] == "DistOptimizerHook":
    if cfg.optimizer_config.get("use_fp16", False):
        model, optimizer = apex.amp.initialize(
            model.cuda(), optimizer, opt_level="O1")
        for m in model.modules():
            if hasattr(m, "fp16_enabled"):
                m.fp16_enabled = True

本实验是否真的能通过单GPU训练达到复现，因为apex工具包的内容全部关于分布式训练，而分布式训练核心在于将不同计算核心开不同的进程，增大并行量，单GPU使用分布式训练（目前我实验使用的方法，因为我感觉代码，尤其是mmcv_custom不容易与分布式训练脱钩）。使用ps aux命令查看进程信息：

可以看出与多进程相关的任务很多，但真正在服务于训练的只有PID 32444的进程，所以分布式训练可能很难真正起作用，因为计算核心只有一个包括即使修复了第一个问题，可以使用混合精度，是否会真正性能提升也是未知数

解决未使用fp16的问题：(8/20 14:28更新)

之前运行dist_train.sh脚本，总提示

RuntimeError: Invoked 'with amp.scale_loss', but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before with `amp.scale_loss`.

这个报错表示：model和optimizer这两个internal Amp state初始化必须在with amp.scale_loss之前。而with amp.scale_loss的调用在site-packages文件夹下mmseg/apis/train.py的train_segmentor类中。故我们一定要修改它，mmcv_custom(mmcv_custom提供了一系列在本次实验中适用的api)中，我们找到了适合的train_segmentor类

'''
1. 将`optimizer = build_optimizer(model, cfg.optimizer)`放在`if distributed:`之前
2. 将初始化Amp state写在optimizer创建和`if distributed:`之间
'''
# use apex fp16 optimizer
if cfg.optimizer_config.get("type", None) and cfg.optimizer_config["type"] == "DistOptimizerHook":
   if cfg.optimizer_config.get("use_fp16", False): # False没懂
       model, optimizer = apex.amp.initialize(
           model.cuda(), optimizer, opt_level="O1")
       for m in model.modules():
           if hasattr(m, "fp16_enabled"):
               m.fp16_enabled = True

在mmseg/apis/train.py中，我写成了：(因为只要分布式训练，我就用amp)

optimizer = build_optimizer(model, cfg.optimizer)
# put model on gpus
if distributed:
    find_unused_parameters = cfg.get('find_unused_parameters', False)
    # Sets the `find_unused_parameters` parameter in
    # torch.nn.parallel.DistributedDataParallel
    model, optimizer = apex.amp.initialize(model.cuda(), optimizer, opt_level="O1")
    model = MMDistributedDataParallel(
        model.cuda(),
        device_ids=[torch.cuda.current_device()],
        broadcast_buffers=False,
        find_unused_parameters=find_unused_parameters)

结果：训练时间缩短为一半，但部分loss出现梯度溢出的情况，且显存消耗依然很大，仍有很多需改进的部分

最开始训练时梯度溢出->推测是: 一开始loss本身过大

网上常见的梯度溢出的可能：

mask_fill使用-1e9 mask的问题

softmax的溢出

sum函数的溢出

为什么我不认为我的是这些常见问题之一：因为这个model的实现没有真正用到fp16，而是一直在用fp32，只有optimizer使用到了fp16，这也是为什么显存始终占用98%，baych size无法提升

解决无法validation的问题：

IterBasedRunnerAmp是现用的runner，按照mmcv_custom的写法，它将被注册到mmcv的registry类当中，但在mmseg的apis/train.py的第110行左右，内容是eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'，在这里，IterBaseRunnerAmp并不会被识别为IterBasedRunner，尽管实际上它是。所以，这里我直接改成了eval_cfg['by_epoch'] = False

validation之后显存不足 -> new issue -> 目前先把validation设为160000iterations做一次，打算先花2天跑完 Training，只看最后的validation结果

Results

差距主要由于batch size(我是4 batchs/GPU * 1 GPU，github是2 batchs/GPU * 8 GPUs)

+---------------------+-------+-------+
| Class               | IoU   | Acc   |
+---------------------+-------+-------+
| wall                | 79.13 | 88.61 |
| building            | 83.17 | 93.27 |
| sky                 | 94.67 | 97.6  |
| floor               | 82.83 | 89.8  |
| tree                | 75.83 | 87.13 |
| ceiling             | 85.1  | 92.23 |
| road                | 85.48 | 91.92 |
| bed                 | 90.11 | 96.59 |
| windowpane          | 63.83 | 77.87 |
| grass               | 64.76 | 78.89 |
| cabinet             | 62.13 | 73.76 |
| sidewalk            | 69.88 | 82.98 |
| person              | 82.44 | 92.74 |
| earth               | 35.82 | 49.7  |
| door                | 51.1  | 65.85 |
| table               | 60.89 | 74.4  |
| mountain            | 59.21 | 72.16 |
| plant               | 53.45 | 64.69 |
| curtain             | 74.71 | 87.09 |
| chair               | 61.9  | 74.23 |
| car                 | 85.93 | 93.92 |
| water               | 60.88 | 75.76 |
| painting            | 74.08 | 88.03 |
| sofa                | 70.52 | 85.48 |
| shelf               | 46.76 | 70.3  |
| house               | 48.95 | 65.93 |
| sea                 | 66.64 | 83.46 |
| mirror              | 70.01 | 79.44 |
| rug                 | 68.24 | 78.04 |
| field               | 27.0  | 45.61 |
| armchair            | 49.67 | 68.81 |
| seat                | 61.57 | 85.05 |
| fence               | 50.39 | 66.54 |
| desk                | 52.51 | 73.18 |
| rock                | 47.69 | 77.52 |
| wardrobe            | 52.41 | 74.16 |
| lamp                | 64.4  | 76.15 |
| bathtub             | 79.66 | 84.88 |
| railing             | 38.76 | 55.33 |
| cushion             | 58.95 | 72.0  |
| base                | 35.35 | 47.07 |
| box                 | 30.96 | 39.84 |
| column              | 48.65 | 59.64 |
| signboard           | 39.84 | 52.53 |
| chest of drawers    | 42.18 | 58.8  |
| counter             | 25.42 | 31.12 |
| sand                | 50.92 | 78.12 |
| sink                | 71.8  | 81.29 |
| skyscraper          | 54.8  | 75.97 |
| fireplace           | 71.98 | 89.09 |
| refrigerator        | 74.0  | 86.95 |
| grandstand          | 51.51 | 82.02 |
| path                | 27.65 | 38.79 |
| stairs              | 24.23 | 32.11 |
| runway              | 69.69 | 93.39 |
| case                | 54.84 | 72.89 |
| pool table          | 93.59 | 97.55 |
| pillow              | 60.04 | 71.62 |
| screen door         | 78.78 | 84.89 |
| stairway            | 32.42 | 44.96 |
| river               | 12.36 | 26.78 |
| bridge              | 47.98 | 56.49 |
| bookcase            | 42.2  | 61.27 |
| blind               | 44.14 | 52.09 |
| coffee table        | 56.83 | 81.54 |
| toilet              | 77.65 | 90.81 |
| flower              | 40.64 | 56.71 |
| book                | 50.39 | 68.83 |
| hill                | 8.76  | 12.16 |
| bench               | 44.48 | 52.77 |
| countertop          | 53.86 | 71.89 |
| stove               | 73.33 | 85.42 |
| palm                | 54.7  | 81.22 |
| kitchen island      | 36.69 | 77.11 |
| computer            | 76.68 | 93.44 |
| swivel chair        | 50.45 | 70.19 |
| boat                | 55.72 | 70.42 |
| bar                 | 46.97 | 63.47 |
| arcade machine      | 77.32 | 83.73 |
| hovel               | 44.66 | 47.13 |
| bus                 | 88.91 | 96.9  |
| towel               | 66.36 | 81.26 |
| light               | 54.84 | 61.62 |
| truck               | 38.17 | 51.35 |
| tower               | 27.19 | 45.83 |
| chandelier          | 67.73 | 84.32 |
| awning              | 34.73 | 40.09 |
| streetlight         | 29.22 | 38.42 |
| booth               | 42.93 | 48.72 |
| television receiver | 71.22 | 81.2  |
| airplane            | 62.18 | 69.12 |
| dirt track          | 12.35 | 26.75 |
| apparel             | 36.96 | 50.77 |
| pole                | 22.87 | 30.81 |
| land                | 4.12  | 7.3   |
| bannister           | 15.48 | 20.49 |
| escalator           | 58.48 | 80.13 |
| ottoman             | 53.12 | 72.89 |
| bottle              | 38.93 | 60.38 |
| buffet              | 54.7  | 64.62 |
| poster              | 41.08 | 49.67 |
| stage               | 15.45 | 27.8  |
| van                 | 44.31 | 57.28 |
| ship                | 52.51 | 77.15 |
| fountain            | 28.03 | 29.07 |
| conveyer belt       | 64.03 | 93.28 |
| canopy              | 40.2  | 55.69 |
| washer              | 82.07 | 86.46 |
| plaything           | 27.36 | 41.42 |
| swimming pool       | 79.88 | 92.08 |
| stool               | 46.4  | 56.69 |
| barrel              | 41.77 | 64.89 |
| basket              | 35.7  | 47.11 |
| waterfall           | 70.1  | 82.53 |
| tent                | 90.61 | 99.4  |
| bag                 | 16.3  | 18.63 |
| minibike            | 69.95 | 84.3  |
| cradle              | 79.32 | 97.42 |
| oven                | 53.71 | 66.81 |
| ball                | 51.44 | 63.86 |
| food                | 56.48 | 71.72 |
| step                | 13.23 | 14.56 |
| tank                | 55.92 | 67.91 |
| trade name          | 23.79 | 26.3  |
| microwave           | 84.24 | 93.93 |
| pot                 | 40.29 | 45.21 |
| animal              | 61.39 | 64.23 |
| bicycle             | 55.97 | 73.42 |
| lake                | 53.18 | 63.79 |
| dishwasher          | 60.84 | 72.38 |
| screen              | 61.71 | 87.5  |
| blanket             | 16.01 | 19.36 |
| sculpture           | 64.79 | 82.8  |
| hood                | 63.25 | 70.51 |
| sconce              | 52.4  | 63.6  |
| vase                | 42.19 | 56.82 |
| traffic light       | 30.93 | 54.64 |
| tray                | 4.9   | 7.15  |
| ashcan              | 41.46 | 55.92 |
| fan                 | 61.34 | 76.1  |
| pier                | 35.28 | 43.99 |
| crt screen          | 17.12 | 23.36 |
| plate               | 56.08 | 73.78 |
| monitor             | 52.86 | 64.91 |
| bulletin board      | 53.05 | 63.52 |
| shower              | 0.0   | 0.0   |
| radiator            | 59.34 | 65.24 |
| glass               | 16.52 | 17.8  |
| clock               | 41.97 | 48.47 |
| flag                | 60.88 | 69.39 |
+---------------------+-------+-------+