diff --git a/huawei/pytorch/modellink/README.md b/huawei/pytorch/modellink/README.md
index cac04d1fbf67b6c01684398a4b0eebe6d61f1299..6f1597510d273d7d5b57ad0b9387f13ef629977f 100644
--- a/huawei/pytorch/modellink/README.md
+++ b/huawei/pytorch/modellink/README.md
@@ -1,9 +1,12 @@
 # ModelLink 负载导航
 ## ModelLink训练负载包取包链接
 ### v0 版本
+branch: master
+commit id: cbf2db3ed9d27a2885558bf40e0957ab5cea2881
 |模型|负载包链接|
 | ----- | ------------------------------- |
-|LLaMA2 7B|x86_64: xxxxxxxxxxxx <br> aarch64: xxxxxxxxxxxx|
+|LLaMA2 7B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_7b-v0.tar.gz) <br> [aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_7b-v0.tar.gz)|
+|LLaMA2 13B|[x86_64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-x86_64-2.0-training-ModelLink-llama2_13b-v0.tar.gz) <br> [aarch64](https://aisbench.obs.cn-north-4.myhuaweicloud.com/workload_packages/train/ModelLink/Ais-Benchmark-Stubs-aarch64-2.0-training-ModelLink-llama2_13b-v0.tar.gz)|
 
 ## 贡献指南
 ### 使用build.sh出负载包
diff --git a/huawei/pytorch/modellink/models/llama2_13b/README.md b/huawei/pytorch/modellink/models/llama2_13b/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..31443d6ad4de8589f5f797dc6b1c809cd5c769f9 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_13b/README.md
@@ -0,0 +1,112 @@
+# llama2 13b 训练负载包使用指南
+本文主要介绍使用基于ModelLink LLaMA2 大模型训练业务代码构建的AISBench的负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz"，进行服务器性能测试的流程。
+## 名词定义
+|名词|	定义|
+| --- | ----------------------------------- |
+|管理节点|运行大模型训练负载的环境，只有一个，执行ais-bench-stubs二进制的环境|
+|计算节点|执行训练任务的环境，可以有多个，管理节点也是计算节点之一|
+## 查看llama2 13b 训练负载包目录结构，简单确认完整性
+解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz"（如果在包中看到本文档忽略此步）
+```bash
+tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_13b-{ModelLink version}.tar.gz
+```
+查看目录结构
+```bash
+├── ais-bench-stubs # 启动测试的二进制文件
+├── code/
+│   ├── benchmark.sh
+│   ├── launch_config.sh
+│   ├── ModelLink # 嵌入了logging打点接口的ModelLink代码
+│   ├── multi_nodes_run.sh
+│   ├── registed_tasks.sh # 注册了可用的ModelLink脚本
+│   └── single_node_run.sh
+├── config/
+│   ├── config.json
+│   └── system.json
+├── log/
+├── result/
+├── README.md # 本文档
+└── STUBS_PACKAGE_INTRO.md
+```
+## ModelLink运行环境准备
+**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”，参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id，后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。<br>
+
+请参考[ModelLink llama2 13b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-13B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。
+
+## AISBench负载运行环境准备
+### 单机运行
+单机运行需要保证运行环境的python版本`>=3.7`。<br>
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases)，安装logging模块：
+```bash
+pip install ais_bench_logging-<version>-py3-none-linux_<arch>.whl --force-reinstall
+```
+### 多机运行
+多机运行需要保证**所有计算节点（含管理节点）**的python版本`>=3.7`。<br>
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases)，在**所有计算节点（含管理节点）**上安装logging模块：
+```bash
+pip install ais_bench_logging-<version>-py3-none-linux_<arch>.whl --force-reinstall
+```
+获取分布式运行组件cluster_tools的[最新发行版alpha版本](https://gitee.com/aisbench/cluster_tools/releases)，在管理节点上安装cluster_tools工具：
+```bash
+pip install ais_bench_cluster-<version>-py3-none-linux_<arch>.whl --force-reinstall
+```
+
+**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json，格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。
+
+
+## 如何确认训练任务是单机还是多机运行？
+查看`code/register_task.sh`文件：
+```bash
+#!/bin/bash
+# 单机运行的任务
+SINGLE_NODE_LAUNCH=( \       # 单机执行的任务
+    "pretrain_llama2_13b_ptd_8p"
+)
+# 多机运行的任务
+MULTI_NODES_LAUNCH=( \       # 多机执行的任务
+    "test_distributed_run"
+)
+```
+查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本：
+```shell
+pretrain_llama2_13b_ptd_8p.sh # 运行在单机8张64G显存加速卡的预训练启动脚本
+
+test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本，与加速卡无关
+```
+
+## 启动前配置
+编辑`code/launch_config.sh`启动文件：
+```bash
+#!/bin/bash
+export AIS_PYTHON=python3 # 使用的python解释器
+export AIS_NODE_FILE_PATH=/home/xx/xx/xx/node_file.json # 分布式运行使用cluster_tools所需包含节点信息和ssh key路径的文件，单机训练不用填
+export AIS_TRAIN_TASK="pretrain_llama2_13b_ptd_8p" # 请从code/registed_task.sh中注册的任务中选择一个填入
+export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径
+export AIS_DATA_PATH="" # 数据集路径
+export AIS_TOKENIZER_MODEL="" # tokenizer路径
+export AIS_CKPT_LOAD_DIR="" # 加载的权重路径，预训练不需要，但是不能为空
+export AIS_TRAIN_ITERS=5000 # 训练迭代次数，default 5000
+export AIS_NUM_LAYERS=32 # 调试使用，模型layer层数，13B:32 13B:40 70B:80
+```
+**备注：**
+`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量：
+|AISBench路径配置|ModelLink启动脚本变量|
+| ---- | ---- |
+|AIS_CKPT_SAVE_DIR|SAVE_CHECKPOINT_PATH|
+|AIS_DATA_PATH|DATA_PATH|
+|AIS_TOKENIZER_MODEL|TOKENIZER_PATH|
+|AIS_CKPT_LOAD_DIR|LOAD_CHECKPOINT_PATH|
+请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径
+
+## 启动测试
+### 在线测试
+在线测试的前置准备请参考`STUBS_PACKAGE_INTRO.md`文档。启动命令：
+```bash
+./ais-bench-stubs
+```
+### 轻量化离线测试
+启动命令：
+```bash
+./ais-bench-stubs test
+```
+
diff --git a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
index 22d13be89eb04e1f779f3a79ef5f3487ed7900a5..5abf9e8058f3f0e758bb01cf220a42c769cb7550 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
+++ b/huawei/pytorch/modellink/models/llama2_13b/registed_tasks.sh
@@ -1,11 +1,10 @@
 #!/bin/bash
 # 单机运行的任务
 SINGLE_NODE_LAUNCH=( \
-    "pretrain_llama2_7b_ptd" \
-    "pretrain_llama2_13b_ptd_8p"
+    "pretrain_llama2_13b_ptd_8p",
+    "tune_llama2_13b_ptd"
 )
 # 多机运行的任务
 MULTI_NODES_LAUNCH=( \
-    "pretrain_llama2_70b_ptd" \
     "test_distributed_run"
 )
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
index 58421561120ada4b9a97b48100362c2ffe1d003c..4a26d5054ae9d0390bc40c0a3aeb3cdd71b1af95 100644
--- a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
+++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/pretrain_llama2_13b_ptd_8p.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 export NPU_ASD_ENABLE=0
 
@@ -81,7 +81,7 @@ OUTPUT_ARGS="
     --eval-iters 10 \
 "
 
-python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
+python -m torch.distributed.launch $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
     $GPT_ARGS \
     $DATA_ARGS \
     $OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9ed65b26b5418d328e8f4505e4cd58ea95184f0e
--- /dev/null
+++ b/huawei/pytorch/modellink/models/llama2_13b/train_scripts/tune_llama2_13b_ptd.sh
@@ -0,0 +1,98 @@
+#!/bin/bash
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0
+
+GPUS_PER_NODE=8
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+SAVE_CHECKPOINT_PATH=$AIS_CKPT_SAVE_DIR # "your model save ckpt path"
+DATA_PATH=$AIS_DATA_PATH # "your data path"
+TOKENIZER_PATH=$AIS_TOKENIZER_MODEL # "your tokenizer path"
+LOAD_CHECKPOINT_PATH=$AIS_CKPT_LOAD_DIR # "your model ckpt path"
+LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR
+
+TP=1
+PP=8
+
+DISTRIBUTED_ARGS="
+    --nproc_per_node $GPUS_PER_NODE \
+    --nnodes $NNODES \
+    --node_rank $NODE_RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT
+"
+
+GPT_ARGS="
+    --tensor-model-parallel-size ${TP} \
+    --pipeline-model-parallel-size ${PP} \
+    --sequence-parallel \
+    --num-layers 40 \
+    --hidden-size 5120 \
+    --ffn-hidden-size 13824 \
+    --num-attention-heads 40 \
+    --tokenizer-type PretrainedFromHF \
+    --tokenizer-name-or-path ${TOKENIZER_PATH} \
+    --tokenizer-not-use-fast \
+    --seq-length 2048 \
+    --max-position-embeddings 2048 \
+    --micro-batch-size 1 \
+    --global-batch-size 128 \
+    --make-vocab-size-divisible-by 1 \
+    --lr 1.0e-6 \
+    --train-iters ${AIS_TRAIN_ITERS} \
+    --lr-decay-style cosine \
+    --untie-embeddings-and-output-weights \
+    --disable-bias-linear \
+    --attention-dropout 0.0 \
+    --init-method-std 0.01 \
+    --hidden-dropout 0.0 \
+    --position-embedding-type rope \
+    --normalization RMSNorm \
+    --use-fused-rmsnorm \
+    --swiglu \
+    --use-flash-attn \
+    --no-masked-softmax-fusion \
+    --attention-softmax-in-fp32 \
+    --min-lr 1.0e-7 \
+    --weight-decay 1e-1 \
+    --lr-warmup-fraction 0.01 \
+    --clip-grad 1.0 \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --initial-loss-scale 65536 \
+    --no-gradient-accumulation-fusion \
+    --load ${LOAD_CHECKPOINT_PATH}  \
+    --lora-load ${LORA_CHECKPOINT} \
+    --no-load-optim \
+    --no-load-rng \
+    --finetune \
+    --is-instruction-dataset \
+    --lora-r 16 \
+    --lora-alpha 32 \
+    --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
+    --bf16
+"
+
+DATA_ARGS="
+    --data-path $DATA_PATH \
+    --split 100,0,0
+"
+
+OUTPUT_ARGS="
+    --log-interval 1 \
+    --save-interval 10000 \
+    --eval-interval 1000 \
+    --eval-iters 10 \
+"
+
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
+    $GPT_ARGS \
+    $DATA_ARGS \
+    $OUTPUT_ARGS \
+    --distributed-backend nccl \
+    --save ${SAVE_CHECKPOINT_PATH}
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_70b/README.md b/huawei/pytorch/modellink/models/llama2_70b/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..fb65d55b34340a6a879adf4715e12e4aaa5329b3 100644
--- a/huawei/pytorch/modellink/models/llama2_70b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_70b/README.md
@@ -0,0 +1,2 @@
+# llama2 70b 训练负载包使用指南
+等待后续支持
\ No newline at end of file
diff --git a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
index 2b713c2f01b81fc2666513bb9686ab10df6c713c..d63fd32023bb38a0f2fec28c4ca1aa9917a24ccc 100644
--- a/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
+++ b/huawei/pytorch/modellink/models/llama2_70b/train_scripts/pretrain_llama2_70b_ptd.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
 export NPU_ASD_ENABLE=0
 
 GPUS_PER_NODE=8
@@ -81,7 +82,7 @@ OUTPUT_ARGS="
     --eval-iters 10 \
 "
 
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
     $GPT_ARGS \
     $DATA_ARGS \
     $OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/models/llama2_7b/README.md b/huawei/pytorch/modellink/models/llama2_7b/README.md
index 1084ad40df285efe14023677ca4fd7fcd47f332d..a53cb5cdbd7a6215a7be5041b312f850f85b7b46 100644
--- a/huawei/pytorch/modellink/models/llama2_7b/README.md
+++ b/huawei/pytorch/modellink/models/llama2_7b/README.md
@@ -4,7 +4,7 @@
 |名词|	定义|
 | --- | ----------------------------------- |
 |管理节点|运行大模型训练负载的环境，只有一个，执行ais-bench-stubs二进制的环境|
-|计算节点|执行训练任务的环境，可以有多个|
+|计算节点|执行训练任务的环境，可以有多个，管理节点也是计算节点之一|
 ## 查看llama2 7b 训练负载包目录结构，简单确认完整性
 解压负载包"Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b-{ModelLink version}.tar.gz"（如果在包中看到本文档忽略此步）
 ```bash
@@ -29,6 +29,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b
 └── STUBS_PACKAGE_INTRO.md
 ```
 ## ModelLink运行环境准备
+**注意**请从[ModelLink负载主页](https://gitee.com/aisbench/training/blob/master/huawei/pytorch/modellink/)的“ModelLink训练负载包版本及取包链接”，参考负载包名中{ModelLink version}获取ModelLink源码仓库的branch和commit id，后续提及的ModelLink源码仓库相关的链接需要切换到对应的branch和commit id。<br>
+
 请参考[ModelLink llama2 7b](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的"LLAMA2-7B"章节准备好运行环境、转换好的数据集、转换好的权重文件和词表文件。
 
 ## AISBench负载运行环境准备
@@ -39,8 +41,8 @@ tar xzf "Ais-Benchmark-Stubs-{arch}-{Stubs version}-training-ModelLink-llama2_7b
 pip install ais_bench_logging-<version>-py3-none-linux_<arch>.whl --force-reinstall
 ```
 ### 多机运行
-多机运行需要保证**管理节点和所有计算节点**的python版本`>=3.7`。<br>
-获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases)，在**管理节点和所有计算节点**上安装logging模块：
+多机运行需要保证**所有计算节点（含管理节点）**的python版本`>=3.7`。<br>
+获取打点模块logging的[最新发行版](https://gitee.com/aisbench/logging/releases)，在**所有计算节点（含管理节点）**上安装logging模块：
 ```bash
 pip install ais_bench_logging-<version>-py3-none-linux_<arch>.whl --force-reinstall
 ```
@@ -49,19 +51,30 @@ pip install ais_bench_logging-<version>-py3-none-linux_<arch>.whl --force-reinst
 pip install ais_bench_cluster-<version>-py3-none-linux_<arch>.whl --force-reinstall
 ```
 
+**管理节点上**使用cluster_tools需要自建集群节点配置文件node_file.json，格式参考[ AISBench分布式运行组件cluster_tools使用说明](https://gitee.com/aisbench/cluster_tools/)的“集群节点信息文件内容格式”章节自行创建。
+
+
 ## 如何确认训练任务是单机还是多机运行？
 查看`code/register_task.sh`文件：
 ```bash
 #!/bin/bash
 # 单机运行的任务
 SINGLE_NODE_LAUNCH=( \       # 单机执行的任务
-    "pretrain_llama2_7b_ptd"
+    "pretrain_llama2_7b_ptd"，
+    "tune_llama2_7b_ptd"
 )
 # 多机运行的任务
 MULTI_NODES_LAUNCH=( \       # 多机执行的任务
     "test_distributed_run"
 )
 ```
+查看`code/ModelLink/`文件夹中是否有`code/register_task.sh`中注册的任务所对应的shell脚本：
+```shell
+pretrain_llama2_7b_ptd.sh # 运行在单机8张64G显存加速卡的预训练启动脚本
+tune_llama2_7b_ptd # 运行在单机8张64G显存加速卡的微调启动脚本
+
+test_distributed_run # 测试多机环境logging与cluster是否正确部署的脚本，与加速卡无关
+```
 
 ## 启动前配置
 编辑`code/launch_config.sh`启动文件：
@@ -74,9 +87,18 @@ export AIS_CKPT_SAVE_DIR="" # 结果权重保存路径
 export AIS_DATA_PATH="" # 数据集路径
 export AIS_TOKENIZER_MODEL="" # tokenizer路径
 export AIS_CKPT_LOAD_DIR="" # 加载的权重路径，预训练不需要，但是不能为空
-export AIS_TRAIN_ITERS=5000 # default 5000
-export AIS_NUM_LAYERS=32 # 7B:32 13B:40 70B:80
+export AIS_TRAIN_ITERS=5000 # 训练迭代次数，default 5000
+export AIS_NUM_LAYERS=32 # 调试使用，模型layer层数，7B:32 13B:40 70B:80
 ```
+**备注：**
+`code/launch_config.sh`的以下环境变量路径对应ModelLink启动脚本中如下变量：
+|AISBench路径配置|ModelLink启动脚本变量|
+| ---- | ---- |
+|AIS_CKPT_SAVE_DIR|CKPT_SAVE_DIR|
+|AIS_DATA_PATH|DATA_PATH|
+|AIS_TOKENIZER_MODEL|TOKENIZER_MODEL|
+|AIS_CKPT_LOAD_DIR|CKPT_LOAD_DIR|
+请参考[ModelLink llama2主页](https://gitee.com/ascend/ModelLink/blob/master/examples/llama2/README.md)的“5.预训练”或“6.微调”章节配置具体路径
 
 ## 启动测试
 ### 在线测试
diff --git a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
index f04575cef2c9589682f9c3f1e088cc08d8627435..933d17f1c7b3005f4fc142ecf56d30a2193c1f3b 100644
--- a/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
+++ b/huawei/pytorch/modellink/models/llama2_7b/train_scripts/tune_llama2_7b_ptd.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-
+CUR_DIR=$(cd "$(dirname "$0")";pwd)
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 export NPU_ASD_ENABLE=0
 
@@ -12,11 +12,11 @@ NNODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
 
-CKPT_SAVE_DIR="your model save lora ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-LORA_CHECKPOINT="your lora ckpt path"
+CKPT_SAVE_DIR=$AIS_CKPT_SAVE_DIR # "your model save ckpt path"
+DATA_PATH=$AIS_DATA_PATH # "your data path"
+TOKENIZER_MODEL=$AIS_TOKENIZER_MODEL # "your tokenizer path"
+CKPT_LOAD_DIR=$AIS_CKPT_LOAD_DIR # "your model ckpt path"
+LORA_CHECKPOINT=$AIS_CKPT_LOAD_DIR
 
 TP=8
 PP=1
@@ -92,7 +92,7 @@ OUTPUT_ARGS="
     --eval-iters 0 \
 "
 
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
+torchrun $DISTRIBUTED_ARGS $CUR_DIR/pretrain_gpt.py \
     $GPT_ARGS \
     $DATA_ARGS \
     $OUTPUT_ARGS \
diff --git a/huawei/pytorch/modellink/patch_files/v0/v0.patch b/huawei/pytorch/modellink/patch_files/v0/v0.patch
index 3dcef76f0033ae7dc14744afe689a6c9790f0287..cf94d4f4e093e517def1d78db1b504c19fe153af 100644
--- a/huawei/pytorch/modellink/patch_files/v0/v0.patch
+++ b/huawei/pytorch/modellink/patch_files/v0/v0.patch
@@ -1,7 +1,7 @@
 diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.py
---- origin/megatron/training.py	2024-03-18 17:56:18.900000000 +0800
-+++ code/megatron/training.py	2024-03-18 17:56:18.944000000 +0800
-@@ -48,6 +48,15 @@
+--- origin/megatron/training.py	2024-05-20 09:18:50.592000000 +0800
++++ code/megatron/training.py	2024-05-20 09:18:50.640000000 +0800
+@@ -48,6 +48,11 @@
  from megatron.core.pipeline_parallel import get_forward_backward_func
  from megatron.utils import report_memory
  from megatron.model.vision.knn_monitor import compute_feature_bank
@@ -9,15 +9,11 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
 +AISBENCH_LOGGING_EXIST = True
 +AISBENCH_RESULT_PATH = os.path.join(os.path.dirname(__file__), "../../result")
 +AISBENCH_TRAIN_TOTAL_TOKENS = 1
-+try:
-+    import ais_bench.logging as aislog
-+except Exception: # if import failed, just run as normal
-+    AISBENCH_LOGGING_EXIST=False
-+    print("[AISBench][WARNING] can not import AISBench logging module, Stubs won't be connected.")
++import ais_bench.logging as aislog
  
  
  def print_datetime(string):
-@@ -118,9 +127,22 @@
+@@ -118,9 +123,22 @@
      # Set pytorch JIT layer fusion options and warmup JIT functions.
      set_jit_fusion_options()
  
@@ -31,43 +27,43 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
 +    global AISBENCH_TRAIN_TOTAL_TOKENS
 +    # get all tokens per device
 +    if args.train_samples:
-+        AISBENCH_TRAIN_TOTAL_TOKENS = args.train_samples * args.seq_length
++        AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_samples * args.seq_length / int(os.environ['WORLD_SIZE']))
 +    else:
-+        AISBENCH_TRAIN_TOTAL_TOKENS = args.train_iters * args.global_batch_size * args.seq_length
++        AISBENCH_TRAIN_TOTAL_TOKENS = int(args.train_iters * args.global_batch_size * args.seq_length / int(os.environ['WORLD_SIZE']))
 +
-+    if AISBENCH_LOGGING_EXIST: # start prepare
-+        aislog.start("prepare", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # start prepare
++    aislog.start("prepare")
      global _TRAIN_START_TIME
      start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME])
      torch.distributed.all_reduce(start_time_tensor,
-@@ -129,6 +151,8 @@
+@@ -129,6 +147,8 @@
      print_rank_0('time to initialize megatron (seconds): {:.3f}'.format(
          time.time() - _TRAIN_START_TIME))
      print_datetime('after megatron is initialized')
-+    if AISBENCH_LOGGING_EXIST: # end prepare
-+        aislog.end("prepare", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # end prepare
++    aislog.end("prepare")
  
      args = get_args()
      timers = get_timers()
-@@ -142,6 +166,8 @@
+@@ -142,6 +162,8 @@
                     'scheduler are built')
      config = get_model_config(model[0])
  
-+    if AISBENCH_LOGGING_EXIST: # start load data
-+        aislog.start("dataload", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # start load data
++    aislog.start("dataload")
      # Data stuff.
      timers('train/valid/test-data-iterators-setup', log_level=0).start(
          barrier=True)
-@@ -165,6 +191,8 @@
+@@ -165,6 +187,8 @@
  
      # Print setup timing.
      print_rank_0('done with setup ...')
-+    if AISBENCH_LOGGING_EXIST: # end load data
-+        aislog.end("dataload", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # end load data
++    aislog.end("dataload")
      timers.log(['model-and-optimizer-setup',
                  'train/valid/test-data-iterators-setup'], barrier=True)
  
-@@ -204,6 +232,9 @@
+@@ -204,6 +228,9 @@
                                     test_data_iterator, model,
                                     iteration, process_non_loss_data_func, config,
                                     verbose=True, write_to_tensorboard=not args.skip_train)
@@ -77,25 +73,41 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
  
  
  def update_train_iters(args):
-@@ -760,6 +791,8 @@
+@@ -760,6 +787,8 @@
  
      timers('interval-time', log_level=0).start(barrier=True)
      print_datetime('before the start of training step')
-+    if AISBENCH_LOGGING_EXIST: # start train
-+        aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # start train
++    aislog.start("train", AISBENCH_TRAIN_TOTAL_TOKENS)
      report_memory_flag = True
      exit = False
  
-@@ -880,6 +913,8 @@
+@@ -780,6 +809,7 @@
+ 
+         update_num_microbatches(args.consumed_train_samples)
+         args.curr_iteration = iteration
++        aislog.start("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE'])))
+         loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
+             train_step(forward_step_func,
+                        train_data_iterator,
+@@ -787,6 +817,7 @@
+                        optimizer,
+                        opt_param_scheduler,
+                        config)
++        aislog.end("train_per_step", int(args.global_batch_size / int(os.environ['WORLD_SIZE'])))
+         iteration += 1
+         args.consumed_train_samples += mpu.get_data_parallel_world_size() * \
+                                        args.micro_batch_size * \
+@@ -880,6 +911,8 @@
              if args.manual_gc_interval != 0 and iteration % args.manual_gc_interval == 0:
                  gc.collect()
  
-+    if AISBENCH_LOGGING_EXIST: # end train
-+        aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS)
++    # end train
++    aislog.end("train", AISBENCH_TRAIN_TOTAL_TOKENS)
      # Flush TensorBoard and WandB writers.
      writer = get_tensorboard_writer()
      if writer:
-@@ -1033,6 +1068,8 @@
+@@ -1033,6 +1066,8 @@
                  wandb_writer.log({
                      '{} validation'.format(key): total_loss_dict[key].item()},
                      iteration)
@@ -105,8 +117,8 @@ diff -Nur '--exclude=*.git*' origin/megatron/training.py code/megatron/training.
      if process_non_loss_data_func is not None and writer and is_last_rank():
          process_non_loss_data_func(collected_non_loss_data, iteration, writer)
 diff -Nur '--exclude=*.git*' origin/pretrain_gpt.py code/pretrain_gpt.py
---- origin/pretrain_gpt.py	2024-03-18 17:56:18.900000000 +0800
-+++ code/pretrain_gpt.py	2024-03-18 17:56:18.916000000 +0800
+--- origin/pretrain_gpt.py	2024-05-20 09:18:50.596000000 +0800
++++ code/pretrain_gpt.py	2024-05-20 09:18:50.608000000 +0800
 @@ -164,7 +164,7 @@
      Args:
          loss_mask (Tensor): Used to mask out some portions of the loss