diff --git a/huawei/pytorch/modellink/README.md b/huawei/pytorch/modellink/README.md index cac04d1fbf67b6c01684398a4b0eebe6d61f1299..6f1597510d273d7d5b57ad0b9387f13ef629977f 100644 --- a/huawei/pytorch/modellink/README.md +++ b/huawei/pytorch/modellink/README.md @@ -1,9 +1,12 @@ # ModelLink 负载导航 ## ModelLink训练负载包取包链接 ### v0 版本 +branch: master +commit id: cbf2db3ed9d27a2885558bf40e0957ab5cea2881 |模型|负载包链接| | ----- | ------------------------------- | -|LLaMA2 7B|x86_64: xxxxxxxxxxxx
aarch64: xxxxxxxxxxxx| +|LLaMA2 7B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_7b-v0.tar.gz)
[aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_7b-v0.tar.gz)| +|LLaMA2 13B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_13b-v0.tar.gz)
[aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_13b-v0.tar.gz)| ## 贡献指南 ### 使用build.sh出负载包 diff --git a/huawei/pytorch/modellink/models/llama2_13b/README.md b/huawei/pytorch/modellink/models/llama2_13b/README.md index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..31443d6ad4de8589f5f797dc6b1c809cd5c769f9 100644 --- a/huawei/pytorch/modellink/models/llama2_13b/README.md +++ b/huawei/pytorch/modellink/models/llama2_13b/README.md @@ -0,0 +1,112 @@ +# llama2 13b 训练负载包使用指南 +本文主要介绍使用基于ModelLink LLaMA2 大模型训练业务代码构建的AISBench的负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz",进行服务器性能测试的流程。 +## 名词定义 +|名词| 定义| +| --- | ----------------------------------- | +|管理节点|运行大模型训练负载的环境,只有一个,执行ais-bench-stubs二进制的环境| +|计算节点|执行训练任务的环境,可以有多个,管理节点也是计算节点之一| +## 查看llama2 13b 训练负载包目录结构,简单确认完整性 +解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz"(如果在包中看到本文档忽略此步) +```bash +tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz +``` +查看目录结构 +```bash +├── ais-bench-stubs # 启动测试的二进制文件 +├── code/ +│   ├── benchmark.sh +│   ├── launch_config.sh +│   ├── ModelLink # 嵌入了logging打点接口的ModelLink代码 +│   ├── multi_nodes_run.sh +│   ├── registed_tasks.sh # 注册了可用的ModelLink脚本 +│   └── single_node_run.sh +├── config/ +│   ├── config.json +│   └── system.json +├── log/ +├── result/ +├── README.md # 本文档 +└── STUBS_PACKAGE_INTRO.md +``` +## ModelLink运行环境准备 +**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”,参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id,后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。
+ +请参考[ModelLink llama2 13b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-13B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。 + +## AISBench负载运行环境准备 +### 单机运行 +单机运行需要保证运行环境的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),安装logging模块: +```bash +pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall +``` +### 多机运行 +多机运行需要保证**所有计算节点(含管理节点)**的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**所有计算节点(含管理节点)**上安装logging模块: +```bash +pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall +``` +获取分布式运行组件cluster_tools的[最新发行版alpha版本](https://gitee.com/aisbench/cluster_tools/releases),在管理节点上安装cluster_tools工具: +```bash +pip install ais_bench_cluster--py3-none-linux_.whl --force-reinstall +``` + +**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json,格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。 + + +## 如何确认训练任务是单机还是多机运行? +查看`code/register_task.sh`文件: +```bash +#!/bin/bash +# 单机运行的任务 +SINGLE_NODE_LAUNCH=( \ # 单机执行的任务 + "pretrain_llama2_13b_ptd_8p" +) +# 多机运行的任务 +MULTI_NODES_LAUNCH=( \ # 多机执行的任务 + "test_distributed_run" +) +``` +查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本: +```shell +pretrain_llama2_13b_ptd_8p.sh # 运行在单机8张64G显存加速卡的预训练启动脚本 + +test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本,与加速卡无关 +``` + +## 启动前配置 +编辑`code/launch_config.sh`启动文件: +```bash +#!/bin/bash +export AIS_PYTHON=python3 # 使用的python解释器 +export AIS_NODE_FILE_PATH=/home/xx/xx/xx/node_file.json # 分布式运行使用cluster_tools所需包含节点信息和ssh key路径的文件,单机训练不用填 +export AIS_TRAIN_TASK="pretrain_llama2_13b_ptd_8p" # 请从code/registed_task.sh中注册的任务中选择一个填入 +export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径 +export AIS_DATA_PATH="" # 数据集路径 +export AIS_TOKENIZER_MODEL="" # tokenizer路径 +export AIS_CKPT_LOAD_DIR="" # 加载的权重路径,预训练不需要,但是不能为空 +export AIS_TRAIN_ITERS=5000 # 训练迭代次数,default 5000 +export AIS_NUM_LAYERS=32 # 调试使用,模型layer层数,13B:32 13B:40 70B:80 +``` +**备注:** +`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量: +|AISBench路径配置|ModelLink启动脚本变量| +| ---- | ---- | +|AIS_CKPT_SAVE_DIR|SAVE_CHECKPOINT_PATH| +|AIS_DATA_PATH|DATA_PATH| +|AIS_TOKENIZER_MODEL|TOKENIZER_PATH| +|AIS_CKPT_LOAD_DIR|LOAD_CHECKPOINT_PATH| +请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径 + +## 启动测试 +### 在线测试 +在线测试的前置准备请参考`STUBS_PACKAGE_INTRO.md`文档。启动命令: +```bash +./ais-bench-stubs +``` +### 轻量化离线测试 +启动命令: +```bash +./ais-bench-stubs test +``` + diff --git a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh index 22d13be89eb04e1f779f3a79ef5f3487ed7900a5..5abf9e8058f3f0e758bb01cf220a42c769cb7550 100644 --- a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh +++ b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh @@ -1,11 +1,10 @@ #!/bin/bash # 单机运行的任务 SINGLE_NODE_LAUNCH=( \ - "pretrain_llama2_7b_ptd" \ - "pretrain_llama2_13b_ptd_8p" + "pretrain_llama2_13b_ptd_8p", + "tune_llama2_13b_ptd" ) # 多机运行的任务 MULTI_NODES_LAUNCH=( \ - "pretrain_llama2_70b_ptd" \ "test_distributed_run" ) \ No newline at end of file diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh index 58421561120ada4b9a97b48100362c2ffe1d003c..4a26d5054ae9d0390bc40c0a3aeb3cdd71b1af95 100644 --- a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh +++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh @@ -1,5 +1,5 @@ #!/bin/bash - +CUR_DIR=$(cd "$(dirname "$0")";pwd) export CUDA_DEVICE_MAX_CONNECTIONS=1 export NPU_ASD_ENABLE=0 @@ -81,7 +81,7 @@ OUTPUT_ARGS=" --eval-iters 10 \ " -python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \ +python -m torch.distributed.launch $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \ $GPT_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh new file mode 100644 index 0000000000000000000000000000000000000000..9ed65b26b5418d328e8f4505e4cd58ea95184f0e --- /dev/null +++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh @@ -0,0 +1,98 @@ +#!/bin/bash +CUR_DIR=$(cd "$(dirname "$0")";pwd) +export CUDA_DEVICE_MAX_CONNECTIONS=1 +export NPU_ASD_ENABLE=0 + +GPUS_PER_NODE=8 +MASTER_ADDR=localhost +MASTER_PORT=6001 +NNODES=1 +NODE_RANK=0 +WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) + +SAVE_CHECKPOINT_PATH=$AIS_CKPT_SAVE_DIR # "your model save ckpt path" +DATA_PATH=$AIS_DATA_PATH # "your data path" +TOKENIZER_PATH=$AIS_TOKENIZER_MODEL # "your tokenizer path" +LOAD_CHECKPOINT_PATH=$AIS_CKPT_LOAD_DIR # "your model ckpt path" +LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR + +TP=1 +PP=8 + +DISTRIBUTED_ARGS=" + --nproc_per_node $GPUS_PER_NODE \ + --nnodes $NNODES \ + --node_rank $NODE_RANK \ + --master_addr $MASTER_ADDR \ + --master_port $MASTER_PORT +" + +GPT_ARGS=" + --tensor-model-parallel-size ${TP} \ + --pipeline-model-parallel-size ${PP} \ + --sequence-parallel \ + --num-layers 40 \ + --hidden-size 5120 \ + --ffn-hidden-size 13824 \ + --num-attention-heads 40 \ + --tokenizer-type PretrainedFromHF \ + --tokenizer-name-or-path ${TOKENIZER_PATH} \ + --tokenizer-not-use-fast \ + --seq-length 2048 \ + --max-position-embeddings 2048 \ + --micro-batch-size 1 \ + --global-batch-size 128 \ + --make-vocab-size-divisible-by 1 \ + --lr 1.0e-6 \ + --train-iters ${AIS_TRAIN_ITERS} \ + --lr-decay-style cosine \ + --untie-embeddings-and-output-weights \ + --disable-bias-linear \ + --attention-dropout 0.0 \ + --init-method-std 0.01 \ + --hidden-dropout 0.0 \ + --position-embedding-type rope \ + --normalization RMSNorm \ + --use-fused-rmsnorm \ + --swiglu \ + --use-flash-attn \ + --no-masked-softmax-fusion \ + --attention-softmax-in-fp32 \ + --min-lr 1.0e-7 \ + --weight-decay 1e-1 \ + --lr-warmup-fraction 0.01 \ + --clip-grad 1.0 \ + --adam-beta1 0.9 \ + --adam-beta2 0.95 \ + --initial-loss-scale 65536 \ + --no-gradient-accumulation-fusion \ + --load ${LOAD_CHECKPOINT_PATH} \ + --lora-load ${LORA_CHECKPOINT} \ + --no-load-optim \ + --no-load-rng \ + --finetune \ + --is-instruction-dataset \ + --lora-r 16 \ + --lora-alpha 32 \ + --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \ + --bf16 +" + +DATA_ARGS=" + --data-path $DATA_PATH \ + --split 100,0,0 +" + +OUTPUT_ARGS=" + --log-interval 1 \ + --save-interval 10000 \ + --eval-interval 1000 \ + --eval-iters 10 \ +" + +torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \ + $GPT_ARGS \ + $DATA_ARGS \ + $OUTPUT_ARGS \ + --distributed-backend nccl \ + --save ${SAVE_CHECKPOINT_PATH} \ No newline at end of file diff --git a/huawei/pytorch/modellink/models/llama2_70b/README.md b/huawei/pytorch/modellink/models/llama2_70b/README.md index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..fb65d55b34340a6a879adf4715e12e4aaa5329b3 100644 --- a/huawei/pytorch/modellink/models/llama2_70b/README.md +++ b/huawei/pytorch/modellink/models/llama2_70b/README.md @@ -0,0 +1,2 @@ +# llama2 70b 训练负载包使用指南 +等待后续支持 \ No newline at end of file diff --git a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh index 2b713c2f01b81fc2666513bb9686ab10df6c713c..d63fd32023bb38a0f2fec28c4ca1aa9917a24ccc 100644 --- a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh +++ b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh @@ -1,4 +1,5 @@ #!/bin/bash +CUR_DIR=$(cd "$(dirname "$0")";pwd) export NPU_ASD_ENABLE=0 GPUS_PER_NODE=8 @@ -81,7 +82,7 @@ OUTPUT_ARGS=" --eval-iters 10 \ " -torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ +torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \ $GPT_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ diff --git a/huawei/pytorch/modellink/models/llama2_7b/README.md b/huawei/pytorch/modellink/models/llama2_7b/README.md index 1084ad40df285efe14023677ca4fd7fcd47f332d..a53cb5cdbd7a6215a7be5041b312f850f85b7b46 100644 --- a/huawei/pytorch/modellink/models/llama2_7b/README.md +++ b/huawei/pytorch/modellink/models/llama2_7b/README.md @@ -4,7 +4,7 @@ |名词| 定义| | --- | ----------------------------------- | |管理节点|运行大模型训练负载的环境,只有一个,执行ais-bench-stubs二进制的环境| -|计算节点|执行训练任务的环境,可以有多个| +|计算节点|执行训练任务的环境,可以有多个,管理节点也是计算节点之一| ## 查看llama2 7b 训练负载包目录结构,简单确认完整性 解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b-{ModelLink version}.tar.gz"(如果在包中看到本文档忽略此步) ```bash @@ -29,6 +29,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b └── STUBS_PACKAGE_INTRO.md ``` ## ModelLink运行环境准备 +**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”,参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id,后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。
+ 请参考[ModelLink llama2 7b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-7B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。 ## AISBench负载运行环境准备 @@ -39,8 +41,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall ``` ### 多机运行 -多机运行需要保证**管理节点和所有计算节点**的python版本`>=3.7`。
-获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**管理节点和所有计算节点**上安装logging模块: +多机运行需要保证**所有计算节点(含管理节点)**的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**所有计算节点(含管理节点)**上安装logging模块: ```bash pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall ``` @@ -49,19 +51,30 @@ pip install ais_bench_logging--py3-none-linux_.whl --force-reinst pip install ais_bench_cluster--py3-none-linux_.whl --force-reinstall ``` +**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json,格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。 + + ## 如何确认训练任务是单机还是多机运行? 查看`code/register_task.sh`文件: ```bash #!/bin/bash # 单机运行的任务 SINGLE_NODE_LAUNCH=( \ # 单机执行的任务 - "pretrain_llama2_7b_ptd" + "pretrain_llama2_7b_ptd", + "tune_llama2_7b_ptd" ) # 多机运行的任务 MULTI_NODES_LAUNCH=( \ # 多机执行的任务 "test_distributed_run" ) ``` +查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本: +```shell +pretrain_llama2_7b_ptd.sh # 运行在单机8张64G显存加速卡的预训练启动脚本 +tune_llama2_7b_ptd # 运行在单机8张64G显存加速卡的微调启动脚本 + +test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本,与加速卡无关 +``` ## 启动前配置 编辑`code/launch_config.sh`启动文件: @@ -74,9 +87,18 @@ export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径 export AIS_DATA_PATH="" # 数据集路径 export AIS_TOKENIZER_MODEL="" # tokenizer路径 export AIS_CKPT_LOAD_DIR="" # 加载的权重路径,预训练不需要,但是不能为空 -export AIS_TRAIN_ITERS=5000 # default 5000 -export AIS_NUM_LAYERS=32 # 7B:32 13B:40 70B:80 +export AIS_TRAIN_ITERS=5000 # 训练迭代次数,default 5000 +export AIS_NUM_LAYERS=32 # 调试使用,模型layer层数,7B:32 13B:40 70B:80 ``` +**备注:** +`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量: +|AISBench路径配置|ModelLink启动脚本变量| +| ---- | ---- | +|AIS_CKPT_SAVE_DIR|CKPT_SAVE_DIR| +|AIS_DATA_PATH|DATA_PATH| +|AIS_TOKENIZER_MODEL|TOKENIZER_MODEL| +|AIS_CKPT_LOAD_DIR|CKPT_LOAD_DIR| +请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径 ## 启动测试 ### 在线测试 diff --git a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh index f04575cef2c9589682f9c3f1e088cc08d8627435..933d17f1c7b3005f4fc142ecf56d30a2193c1f3b 100644 --- a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh +++ b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh @@ -1,5 +1,5 @@ #!/bin/bash - +CUR_DIR=$(cd "$(dirname "$0")";pwd) export CUDA_DEVICE_MAX_CONNECTIONS=1 export NPU_ASD_ENABLE=0 @@ -12,11 +12,11 @@ NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) -CKPT_SAVE_DIR="your model save lora ckpt path" -DATA_PATH="your data path" -TOKENIZER_MODEL="your tokenizer path" -CKPT_LOAD_DIR="your model ckpt path" -LORA_CHECKPOINT="your lora ckpt path" +CKPT_SAVE_DIR=$AIS_CKPT_SAVE_DIR # "your model save ckpt path" +DATA_PATH=$AIS_DATA_PATH # "your data path" +TOKENIZER_MODEL=$AIS_TOKENIZER_MODEL # "your tokenizer path" +CKPT_LOAD_DIR=$AIS_CKPT_LOAD_DIR # "your model ckpt path" +LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR TP=8 PP=1 @@ -92,7 +92,7 @@ OUTPUT_ARGS=" --eval-iters 0 \ " -torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ +torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \ $GPT_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ diff --git a/huawei/pytorch/modellink/patch_files/v0/v0.patch b/huawei/pytorch/modellink/patch_files/v0/v0.patch index 3dcef76f0033ae7dc14744afe689a6c9790f0287..cf94d4f4e093e517def1d78db1b504c19fe153af 100644 --- a/huawei/pytorch/modellink/patch_files/v0/v0.patch +++ b/huawei/pytorch/modellink/patch_files/v0/v0.patch @@ -1,7 +1,7 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.py ---- origin/megatron/training.py 2024-03-18 17:56:18.900000000 +0800 -+++ code/megatron/training.py 2024-03-18 17:56:18.944000000 +0800 -@@ -48,6 +48,15 @@ +--- origin/megatron/training.py 2024-05-20 09:18:50.592000000 +0800 ++++ code/megatron/training.py 2024-05-20 09:18:50.640000000 +0800 +@@ -48,6 +48,11 @@ from megatron.core.pipeline_parallel import get_forward_backward_func from megatron.utils import report_memory from megatron.model.vision.knn_monitor import compute_feature_bank @@ -9,15 +9,11 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training. +AISBENCH_LOGGING_EXIST = True +AISBENCH_RESULT_PATH = os.path.join(os.path.dirname(__file__), "../../result") +AISBENCH_TRAIN_TOTAL_TOKENS = 1 -+try: -+ import ais_bench.logging as aislog -+except Exception: # if import failed, just run as normal -+ AISBENCH_LOGGING_EXIST=False -+ print("[AISBench][WARNING] can not import AISBench logging module, Stubs won't be connected.") ++import ais_bench.logging as aislog def print_datetime(string): -@@ -118,9 +127,22 @@ +@@ -118,9 +123,22 @@ # Set pytorch JIT layer fusion options and warmup JIT functions. set_jit_fusion_options() @@ -31,43 +27,43 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training. + global AISBENCH_TRAIN_TOTAL_TOKENS + # get all tokens per device + if args.train_samples: -+ AISBENCH_TRAIN_TOTAL_TOKENS = args.train_samples * args.seq_length ++ AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_samples * args.seq_length / int(os.environ['WORLD_SIZE'])) + else: -+ AISBENCH_TRAIN_TOTAL_TOKENS = args.train_iters * args.global_batch_size * args.seq_length ++ AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_iters * args.global_batch_size * args.seq_length / int(os.environ['WORLD_SIZE'])) + -+ if AISBENCH_LOGGING_EXIST: # start prepare -+ aislog.start("prepare", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # start prepare ++ aislog.start("prepare") global _TRAIN_START_TIME start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME]) torch.distributed.all_reduce(start_time_tensor, -@@ -129,6 +151,8 @@ +@@ -129,6 +147,8 @@ print_rank_0('time to initialize megatron (seconds): {:.3f}'.format( time.time() - _TRAIN_START_TIME)) print_datetime('after megatron is initialized') -+ if AISBENCH_LOGGING_EXIST: # end prepare -+ aislog.end("prepare", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # end prepare ++ aislog.end("prepare") args = get_args() timers = get_timers() -@@ -142,6 +166,8 @@ +@@ -142,6 +162,8 @@ 'scheduler are built') config = get_model_config(model[0]) -+ if AISBENCH_LOGGING_EXIST: # start load data -+ aislog.start("dataload", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # start load data ++ aislog.start("dataload") # Data stuff. timers('train/valid/test-data-iterators-setup', log_level=0).start( barrier=True) -@@ -165,6 +191,8 @@ +@@ -165,6 +187,8 @@ # Print setup timing. print_rank_0('done with setup ...') -+ if AISBENCH_LOGGING_EXIST: # end load data -+ aislog.end("dataload", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # end load data ++ aislog.end("dataload") timers.log(['model-and-optimizer-setup', 'train/valid/test-data-iterators-setup'], barrier=True) -@@ -204,6 +232,9 @@ +@@ -204,6 +228,9 @@ test_data_iterator, model, iteration, process_non_loss_data_func, config, verbose=True, write_to_tensorboard=not args.skip_train) @@ -77,25 +73,41 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training. def update_train_iters(args): -@@ -760,6 +791,8 @@ +@@ -760,6 +787,8 @@ timers('interval-time', log_level=0).start(barrier=True) print_datetime('before the start of training step') -+ if AISBENCH_LOGGING_EXIST: # start train -+ aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # start train ++ aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS) report_memory_flag = True exit = False -@@ -880,6 +913,8 @@ +@@ -780,6 +809,7 @@ + + update_num_microbatches(args.consumed_train_samples) + args.curr_iteration = iteration ++ aislog.start("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE']))) + loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \ + train_step(forward_step_func, + train_data_iterator, +@@ -787,6 +817,7 @@ + optimizer, + opt_param_scheduler, + config) ++ aislog.end("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE']))) + iteration += 1 + args.consumed_train_samples += mpu.get_data_parallel_world_size() * \ + args.micro_batch_size * \ +@@ -880,6 +911,8 @@ if args.manual_gc_interval != 0 and iteration % args.manual_gc_interval == 0: gc.collect() -+ if AISBENCH_LOGGING_EXIST: # end train -+ aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS) ++ # end train ++ aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS) # Flush TensorBoard and WandB writers. writer = get_tensorboard_writer() if writer: -@@ -1033,6 +1068,8 @@ +@@ -1033,6 +1066,8 @@ wandb_writer.log({ '{} validation'.format(key): total_loss_dict[key].item()}, iteration) @@ -105,8 +117,8 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training. if process_non_loss_data_func is not None and writer and is_last_rank(): process_non_loss_data_func(collected_non_loss_data, iteration, writer) diff -Nur '--exclude=*.git*' origin/pretrain_gpt.py code/pretrain_gpt.py ---- origin/pretrain_gpt.py 2024-03-18 17:56:18.900000000 +0800 -+++ code/pretrain_gpt.py 2024-03-18 17:56:18.916000000 +0800 +--- origin/pretrain_gpt.py 2024-05-20 09:18:50.596000000 +0800 ++++ code/pretrain_gpt.py 2024-05-20 09:18:50.608000000 +0800 @@ -164,7 +164,7 @@ Args: loss_mask (Tensor): Used to mask out some portions of the loss