diff --git a/huawei/pytorch/modellink/README.md b/huawei/pytorch/modellink/README.md
index cac04d1fbf67b6c01684398a4b0eebe6d61f1299..6f1597510d273d7d5b57ad0b9387f13ef629977f 100644
--- a/huawei/pytorch/modellink/README.md
+++ b/huawei/pytorch/modellink/README.md
@@ -1,9 +1,12 @@
# ModelLink 负载导航
## ModelLink训练负载包取包链接
### v0 版本
+branch: master
+commit id: cbf2db3ed9d27a2885558bf40e0957ab5cea2881
|模型|负载包链接|
| ----- | ------------------------------- |
-|LLaMA2 7B|x86_64: xxxxxxxxxxxx
aarch64: xxxxxxxxxxxx|
+|LLaMA2 7B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_7b-v0.tar.gz)
[aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_7b-v0.tar.gz)|
+|LLaMA2 13B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_13b-v0.tar.gz)
[aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_13b-v0.tar.gz)|
## 贡献指南
### 使用build.sh出负载包
diff --git a/huawei/pytorch/modellink/models/llama2_13b/README.md b/huawei/pytorch/modellink/models/llama2_13b/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..31443d6ad4de8589f5f797dc6b1c809cd5c769f9 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_13b/README.md
@@ -0,0 +1,112 @@
+# llama2 13b 训练负载包使用指南
+本文主要介绍使用基于ModelLink LLaMA2 大模型训练业务代码构建的AISBench的负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz",进行服务器性能测试的流程。
+## 名词定义
+|名词| 定义|
+| --- | ----------------------------------- |
+|管理节点|运行大模型训练负载的环境,只有一个,执行ais-bench-stubs二进制的环境|
+|计算节点|执行训练任务的环境,可以有多个,管理节点也是计算节点之一|
+## 查看llama2 13b 训练负载包目录结构,简单确认完整性
+解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz"(如果在包中看到本文档忽略此步)
+```bash
+tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz
+```
+查看目录结构
+```bash
+├── ais-bench-stubs # 启动测试的二进制文件
+├── code/
+│ ├── benchmark.sh
+│ ├── launch_config.sh
+│ ├── ModelLink # 嵌入了logging打点接口的ModelLink代码
+│ ├── multi_nodes_run.sh
+│ ├── registed_tasks.sh # 注册了可用的ModelLink脚本
+│ └── single_node_run.sh
+├── config/
+│ ├── config.json
+│ └── system.json
+├── log/
+├── result/
+├── README.md # 本文档
+└── STUBS_PACKAGE_INTRO.md
+```
+## ModelLink运行环境准备
+**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”,参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id,后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。
+
+请参考[ModelLink llama2 13b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-13B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。
+
+## AISBench负载运行环境准备
+### 单机运行
+单机运行需要保证运行环境的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),安装logging模块:
+```bash
+pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall
+```
+### 多机运行
+多机运行需要保证**所有计算节点(含管理节点)**的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**所有计算节点(含管理节点)**上安装logging模块:
+```bash
+pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall
+```
+获取分布式运行组件cluster_tools的[最新发行版alpha版本](https://gitee.com/aisbench/cluster_tools/releases),在管理节点上安装cluster_tools工具:
+```bash
+pip install ais_bench_cluster--py3-none-linux_.whl --force-reinstall
+```
+
+**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json,格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。
+
+
+## 如何确认训练任务是单机还是多机运行?
+查看`code/register_task.sh`文件:
+```bash
+#!/bin/bash
+# 单机运行的任务
+SINGLE_NODE_LAUNCH=( \ # 单机执行的任务
+ "pretrain_llama2_13b_ptd_8p"
+)
+# 多机运行的任务
+MULTI_NODES_LAUNCH=( \ # 多机执行的任务
+ "test_distributed_run"
+)
+```
+查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本:
+```shell
+pretrain_llama2_13b_ptd_8p.sh # 运行在单机8张64G显存加速卡的预训练启动脚本
+
+test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本,与加速卡无关
+```
+
+## 启动前配置
+编辑`code/launch_config.sh`启动文件:
+```bash
+#!/bin/bash
+export AIS_PYTHON=python3 # 使用的python解释器
+export AIS_NODE_FILE_PATH=/home/xx/xx/xx/node_file.json # 分布式运行使用cluster_tools所需包含节点信息和ssh key路径的文件,单机训练不用填
+export AIS_TRAIN_TASK="pretrain_llama2_13b_ptd_8p" # 请从code/registed_task.sh中注册的任务中选择一个填入
+export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径
+export AIS_DATA_PATH="" # 数据集路径
+export AIS_TOKENIZER_MODEL="" # tokenizer路径
+export AIS_CKPT_LOAD_DIR="" # 加载的权重路径,预训练不需要,但是不能为空
+export AIS_TRAIN_ITERS=5000 # 训练迭代次数,default 5000
+export AIS_NUM_LAYERS=32 # 调试使用,模型layer层数,13B:32 13B:40 70B:80
+```
+**备注:**
+`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量:
+|AISBench路径配置|ModelLink启动脚本变量|
+| ---- | ---- |
+|AIS_CKPT_SAVE_DIR|SAVE_CHECKPOINT_PATH|
+|AIS_DATA_PATH|DATA_PATH|
+|AIS_TOKENIZER_MODEL|TOKENIZER_PATH|
+|AIS_CKPT_LOAD_DIR|LOAD_CHECKPOINT_PATH|
+请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径
+
+## 启动测试
+### 在线测试
+在线测试的前置准备请参考`STUBS_PACKAGE_INTRO.md`文档。启动命令:
+```bash
+./ais-bench-stubs
+```
+### 轻量化离线测试
+启动命令:
+```bash
+./ais-bench-stubs test
+```
+
diff --git a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
index 22d13be89eb04e1f779f3a79ef5f3487ed7900a5..5abf9e8058f3f0e758bb01cf220a42c769cb7550 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
+++ b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
@@ -1,11 +1,10 @@
#!/bin/bash
# 单机运行的任务
SINGLE_NODE_LAUNCH=( \
- "pretrain_llama2_7b_ptd" \
- "pretrain_llama2_13b_ptd_8p"
+ "pretrain_llama2_13b_ptd_8p",
+ "tune_llama2_13b_ptd"
)
# 多机运行的任务
MULTI_NODES_LAUNCH=( \
- "pretrain_llama2_70b_ptd" \
"test_distributed_run"
)
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
index 58421561120ada4b9a97b48100362c2ffe1d003c..4a26d5054ae9d0390bc40c0a3aeb3cdd71b1af95 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
+++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
@@ -1,5 +1,5 @@
#!/bin/bash
-
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
@@ -81,7 +81,7 @@ OUTPUT_ARGS="
--eval-iters 10 \
"
-python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
+python -m torch.distributed.launch $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9ed65b26b5418d328e8f4505e4cd58ea95184f0e
--- /dev/null
+++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh
@@ -0,0 +1,98 @@
+#!/bin/bash
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0
+
+GPUS_PER_NODE=8
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+SAVE_CHECKPOINT_PATH=$AIS_CKPT_SAVE_DIR # "your model save ckpt path"
+DATA_PATH=$AIS_DATA_PATH # "your data path"
+TOKENIZER_PATH=$AIS_TOKENIZER_MODEL # "your tokenizer path"
+LOAD_CHECKPOINT_PATH=$AIS_CKPT_LOAD_DIR # "your model ckpt path"
+LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR
+
+TP=1
+PP=8
+
+DISTRIBUTED_ARGS="
+ --nproc_per_node $GPUS_PER_NODE \
+ --nnodes $NNODES \
+ --node_rank $NODE_RANK \
+ --master_addr $MASTER_ADDR \
+ --master_port $MASTER_PORT
+"
+
+GPT_ARGS="
+ --tensor-model-parallel-size ${TP} \
+ --pipeline-model-parallel-size ${PP} \
+ --sequence-parallel \
+ --num-layers 40 \
+ --hidden-size 5120 \
+ --ffn-hidden-size 13824 \
+ --num-attention-heads 40 \
+ --tokenizer-type PretrainedFromHF \
+ --tokenizer-name-or-path ${TOKENIZER_PATH} \
+ --tokenizer-not-use-fast \
+ --seq-length 2048 \
+ --max-position-embeddings 2048 \
+ --micro-batch-size 1 \
+ --global-batch-size 128 \
+ --make-vocab-size-divisible-by 1 \
+ --lr 1.0e-6 \
+ --train-iters ${AIS_TRAIN_ITERS} \
+ --lr-decay-style cosine \
+ --untie-embeddings-and-output-weights \
+ --disable-bias-linear \
+ --attention-dropout 0.0 \
+ --init-method-std 0.01 \
+ --hidden-dropout 0.0 \
+ --position-embedding-type rope \
+ --normalization RMSNorm \
+ --use-fused-rmsnorm \
+ --swiglu \
+ --use-flash-attn \
+ --no-masked-softmax-fusion \
+ --attention-softmax-in-fp32 \
+ --min-lr 1.0e-7 \
+ --weight-decay 1e-1 \
+ --lr-warmup-fraction 0.01 \
+ --clip-grad 1.0 \
+ --adam-beta1 0.9 \
+ --adam-beta2 0.95 \
+ --initial-loss-scale 65536 \
+ --no-gradient-accumulation-fusion \
+ --load ${LOAD_CHECKPOINT_PATH} \
+ --lora-load ${LORA_CHECKPOINT} \
+ --no-load-optim \
+ --no-load-rng \
+ --finetune \
+ --is-instruction-dataset \
+ --lora-r 16 \
+ --lora-alpha 32 \
+ --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
+ --bf16
+"
+
+DATA_ARGS="
+ --data-path $DATA_PATH \
+ --split 100,0,0
+"
+
+OUTPUT_ARGS="
+ --log-interval 1 \
+ --save-interval 10000 \
+ --eval-interval 1000 \
+ --eval-iters 10 \
+"
+
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
+ $GPT_ARGS \
+ $DATA_ARGS \
+ $OUTPUT_ARGS \
+ --distributed-backend nccl \
+ --save ${SAVE_CHECKPOINT_PATH}
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_70b/README.md b/huawei/pytorch/modellink/models/llama2_70b/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..fb65d55b34340a6a879adf4715e12e4aaa5329b3 100644
--- a/huawei/pytorch/modellink/models/llama2_70b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_70b/README.md
@@ -0,0 +1,2 @@
+# llama2 70b 训练负载包使用指南
+等待后续支持
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
index 2b713c2f01b81fc2666513bb9686ab10df6c713c..d63fd32023bb38a0f2fec28c4ca1aa9917a24ccc 100644
--- a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
+++ b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
@@ -1,4 +1,5 @@
#!/bin/bash
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
@@ -81,7 +82,7 @@ OUTPUT_ARGS="
--eval-iters 10 \
"
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/models/llama2_7b/README.md b/huawei/pytorch/modellink/models/llama2_7b/README.md
index 1084ad40df285efe14023677ca4fd7fcd47f332d..a53cb5cdbd7a6215a7be5041b312f850f85b7b46 100644
--- a/huawei/pytorch/modellink/models/llama2_7b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_7b/README.md
@@ -4,7 +4,7 @@
|名词| 定义|
| --- | ----------------------------------- |
|管理节点|运行大模型训练负载的环境,只有一个,执行ais-bench-stubs二进制的环境|
-|计算节点|执行训练任务的环境,可以有多个|
+|计算节点|执行训练任务的环境,可以有多个,管理节点也是计算节点之一|
## 查看llama2 7b 训练负载包目录结构,简单确认完整性
解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b-{ModelLink version}.tar.gz"(如果在包中看到本文档忽略此步)
```bash
@@ -29,6 +29,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b
└── STUBS_PACKAGE_INTRO.md
```
## ModelLink运行环境准备
+**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”,参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id,后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。
+
请参考[ModelLink llama2 7b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-7B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。
## AISBench负载运行环境准备
@@ -39,8 +41,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b
pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall
```
### 多机运行
-多机运行需要保证**管理节点和所有计算节点**的python版本`>=3.7`。
-获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**管理节点和所有计算节点**上安装logging模块:
+多机运行需要保证**所有计算节点(含管理节点)**的python版本`>=3.7`。
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases),在**所有计算节点(含管理节点)**上安装logging模块:
```bash
pip install ais_bench_logging--py3-none-linux_.whl --force-reinstall
```
@@ -49,19 +51,30 @@ pip install ais_bench_logging--py3-none-linux_.whl --force-reinst
pip install ais_bench_cluster--py3-none-linux_.whl --force-reinstall
```
+**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json,格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。
+
+
## 如何确认训练任务是单机还是多机运行?
查看`code/register_task.sh`文件:
```bash
#!/bin/bash
# 单机运行的任务
SINGLE_NODE_LAUNCH=( \ # 单机执行的任务
- "pretrain_llama2_7b_ptd"
+ "pretrain_llama2_7b_ptd",
+ "tune_llama2_7b_ptd"
)
# 多机运行的任务
MULTI_NODES_LAUNCH=( \ # 多机执行的任务
"test_distributed_run"
)
```
+查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本:
+```shell
+pretrain_llama2_7b_ptd.sh # 运行在单机8张64G显存加速卡的预训练启动脚本
+tune_llama2_7b_ptd # 运行在单机8张64G显存加速卡的微调启动脚本
+
+test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本,与加速卡无关
+```
## 启动前配置
编辑`code/launch_config.sh`启动文件:
@@ -74,9 +87,18 @@ export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径
export AIS_DATA_PATH="" # 数据集路径
export AIS_TOKENIZER_MODEL="" # tokenizer路径
export AIS_CKPT_LOAD_DIR="" # 加载的权重路径,预训练不需要,但是不能为空
-export AIS_TRAIN_ITERS=5000 # default 5000
-export AIS_NUM_LAYERS=32 # 7B:32 13B:40 70B:80
+export AIS_TRAIN_ITERS=5000 # 训练迭代次数,default 5000
+export AIS_NUM_LAYERS=32 # 调试使用,模型layer层数,7B:32 13B:40 70B:80
```
+**备注:**
+`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量:
+|AISBench路径配置|ModelLink启动脚本变量|
+| ---- | ---- |
+|AIS_CKPT_SAVE_DIR|CKPT_SAVE_DIR|
+|AIS_DATA_PATH|DATA_PATH|
+|AIS_TOKENIZER_MODEL|TOKENIZER_MODEL|
+|AIS_CKPT_LOAD_DIR|CKPT_LOAD_DIR|
+请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径
## 启动测试
### 在线测试
diff --git a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
index f04575cef2c9589682f9c3f1e088cc08d8627435..933d17f1c7b3005f4fc142ecf56d30a2193c1f3b 100644
--- a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
+++ b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
@@ -1,5 +1,5 @@
#!/bin/bash
-
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
@@ -12,11 +12,11 @@ NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-CKPT_SAVE_DIR="your model save lora ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-LORA_CHECKPOINT="your lora ckpt path"
+CKPT_SAVE_DIR=$AIS_CKPT_SAVE_DIR # "your model save ckpt path"
+DATA_PATH=$AIS_DATA_PATH # "your data path"
+TOKENIZER_MODEL=$AIS_TOKENIZER_MODEL # "your tokenizer path"
+CKPT_LOAD_DIR=$AIS_CKPT_LOAD_DIR # "your model ckpt path"
+LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR
TP=8
PP=1
@@ -92,7 +92,7 @@ OUTPUT_ARGS="
--eval-iters 0 \
"
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/patch_files/v0/v0.patch b/huawei/pytorch/modellink/patch_files/v0/v0.patch
index 3dcef76f0033ae7dc14744afe689a6c9790f0287..cf94d4f4e093e517def1d78db1b504c19fe153af 100644
--- a/huawei/pytorch/modellink/patch_files/v0/v0.patch
+++ b/huawei/pytorch/modellink/patch_files/v0/v0.patch
@@ -1,7 +1,7 @@
diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.py
---- origin/megatron/training.py 2024-03-18 17:56:18.900000000 +0800
-+++ code/megatron/training.py 2024-03-18 17:56:18.944000000 +0800
-@@ -48,6 +48,15 @@
+--- origin/megatron/training.py 2024-05-20 09:18:50.592000000 +0800
++++ code/megatron/training.py 2024-05-20 09:18:50.640000000 +0800
+@@ -48,6 +48,11 @@
from megatron.core.pipeline_parallel import get_forward_backward_func
from megatron.utils import report_memory
from megatron.model.vision.knn_monitor import compute_feature_bank
@@ -9,15 +9,11 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
+AISBENCH_LOGGING_EXIST = True
+AISBENCH_RESULT_PATH = os.path.join(os.path.dirname(__file__), "../../result")
+AISBENCH_TRAIN_TOTAL_TOKENS = 1
-+try:
-+ import ais_bench.logging as aislog
-+except Exception: # if import failed, just run as normal
-+ AISBENCH_LOGGING_EXIST=False
-+ print("[AISBench][WARNING] can not import AISBench logging module, Stubs won't be connected.")
++import ais_bench.logging as aislog
def print_datetime(string):
-@@ -118,9 +127,22 @@
+@@ -118,9 +123,22 @@
# Set pytorch JIT layer fusion options and warmup JIT functions.
set_jit_fusion_options()
@@ -31,43 +27,43 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
+ global AISBENCH_TRAIN_TOTAL_TOKENS
+ # get all tokens per device
+ if args.train_samples:
-+ AISBENCH_TRAIN_TOTAL_TOKENS = args.train_samples * args.seq_length
++ AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_samples * args.seq_length / int(os.environ['WORLD_SIZE']))
+ else:
-+ AISBENCH_TRAIN_TOTAL_TOKENS = args.train_iters * args.global_batch_size * args.seq_length
++ AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_iters * args.global_batch_size * args.seq_length / int(os.environ['WORLD_SIZE']))
+
-+ if AISBENCH_LOGGING_EXIST: # start prepare
-+ aislog.start("prepare", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # start prepare
++ aislog.start("prepare")
global _TRAIN_START_TIME
start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME])
torch.distributed.all_reduce(start_time_tensor,
-@@ -129,6 +151,8 @@
+@@ -129,6 +147,8 @@
print_rank_0('time to initialize megatron (seconds): {:.3f}'.format(
time.time() - _TRAIN_START_TIME))
print_datetime('after megatron is initialized')
-+ if AISBENCH_LOGGING_EXIST: # end prepare
-+ aislog.end("prepare", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # end prepare
++ aislog.end("prepare")
args = get_args()
timers = get_timers()
-@@ -142,6 +166,8 @@
+@@ -142,6 +162,8 @@
'scheduler are built')
config = get_model_config(model[0])
-+ if AISBENCH_LOGGING_EXIST: # start load data
-+ aislog.start("dataload", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # start load data
++ aislog.start("dataload")
# Data stuff.
timers('train/valid/test-data-iterators-setup', log_level=0).start(
barrier=True)
-@@ -165,6 +191,8 @@
+@@ -165,6 +187,8 @@
# Print setup timing.
print_rank_0('done with setup ...')
-+ if AISBENCH_LOGGING_EXIST: # end load data
-+ aislog.end("dataload", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # end load data
++ aislog.end("dataload")
timers.log(['model-and-optimizer-setup',
'train/valid/test-data-iterators-setup'], barrier=True)
-@@ -204,6 +232,9 @@
+@@ -204,6 +228,9 @@
test_data_iterator, model,
iteration, process_non_loss_data_func, config,
verbose=True, write_to_tensorboard=not args.skip_train)
@@ -77,25 +73,41 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
def update_train_iters(args):
-@@ -760,6 +791,8 @@
+@@ -760,6 +787,8 @@
timers('interval-time', log_level=0).start(barrier=True)
print_datetime('before the start of training step')
-+ if AISBENCH_LOGGING_EXIST: # start train
-+ aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # start train
++ aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS)
report_memory_flag = True
exit = False
-@@ -880,6 +913,8 @@
+@@ -780,6 +809,7 @@
+
+ update_num_microbatches(args.consumed_train_samples)
+ args.curr_iteration = iteration
++ aislog.start("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE'])))
+ loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
+ train_step(forward_step_func,
+ train_data_iterator,
+@@ -787,6 +817,7 @@
+ optimizer,
+ opt_param_scheduler,
+ config)
++ aislog.end("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE'])))
+ iteration += 1
+ args.consumed_train_samples += mpu.get_data_parallel_world_size() * \
+ args.micro_batch_size * \
+@@ -880,6 +911,8 @@
if args.manual_gc_interval != 0 and iteration % args.manual_gc_interval == 0:
gc.collect()
-+ if AISBENCH_LOGGING_EXIST: # end train
-+ aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS)
++ # end train
++ aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS)
# Flush TensorBoard and WandB writers.
writer = get_tensorboard_writer()
if writer:
-@@ -1033,6 +1068,8 @@
+@@ -1033,6 +1066,8 @@
wandb_writer.log({
'{} validation'.format(key): total_loss_dict[key].item()},
iteration)
@@ -105,8 +117,8 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
if process_non_loss_data_func is not None and writer and is_last_rank():
process_non_loss_data_func(collected_non_loss_data, iteration, writer)
diff -Nur '--exclude=*.git*' origin/pretrain_gpt.py code/pretrain_gpt.py
---- origin/pretrain_gpt.py 2024-03-18 17:56:18.900000000 +0800
-+++ code/pretrain_gpt.py 2024-03-18 17:56:18.916000000 +0800
+--- origin/pretrain_gpt.py 2024-05-20 09:18:50.596000000 +0800
++++ code/pretrain_gpt.py 2024-05-20 09:18:50.608000000 +0800
@@ -164,7 +164,7 @@
Args:
loss_mask (Tensor): Used to mask out some portions of the loss