# ERNIE-Gram **Repository Path**: baidu/ERNIE-Gram ## Basic Information - **Project Name**: ERNIE-Gram - **Description**: 多粒度语言知识模型 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-02-20 - **Last Updated**: 2024-02-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## _ERNIE-Gram_: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding - [Proposed Methods](#proposed-methods) - [Pre-trained Models](#pre-trained-models) - [Fine-tuning on Downstream Tasks](#fine-tuning-on-downstream-tasks) * [GLUE](#glue-benchmark) * [SQuAD](#squad-benchmark) - [Usage](#usage) * [Install PaddlePaddle](#install-paddlepaddle) * [Fine-tuning](#fine-tuning) * [Employ Dynamic Computation Graph](#employ-dynamic-computation-graph) - [Citation](#citation) - [Communication](#communication) For technical description of the algorithm, please see our paper: >[_**ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding**_](https://www.aclweb.org/anthology/2021.naacl-main.136/) > >Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang > >Accepted by **NAACL-HLT 2021** ![ERNIE-Gram](https://img.shields.io/badge/Pretraining-Language%20Understanding-green) ![GLUE](https://img.shields.io/badge/GLUE-The%20General%20Language%20Understanding%20Evaluation-yellow) ![SQuAD](https://img.shields.io/badge/SQuAD-The%20Stanford%20Question%20Answering-blue) ![RACE](https://img.shields.io/badge/RACE-The%20ReAding%20Comprehension%20from%20Examinations-green) --- **[ERNIE-Gram](https://www.aclweb.org/anthology/2021.naacl-main.136/)** is an **explicitly** n-gram masking and predicting method to eliminate the limitations of previous contiguously masking strategies and incorporate coarse-grained linguistic information into pre-training sufficiently. To model the intra-dependencies and inter-relation of coarse-grained linguistic information, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of n tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling. ## Proposed Methods We construct three novel methods to model the intra-dependencies and inter-relation of coarse-grained linguistic information: - **Explicitly N-gram Masked Language Modeling**: n-grams are masked with single [MASK] symbols, and predicted directly using explicit n-gram identities rather than sequences of tokens. - **Comprehensive N-gram Prediction**: masked n-grams are simultaneously predicted in coarse-grained (explicit n-gram identities) and fine-grained (contained token identities) manners. - **Enhanced N-gram Relation Modeling**: n-grams are masked with plausible n-grams identities sampled from a generator model, and then recovered to the original n-grams. ![ernie-gram](.meta/ernie-gram.png) ## Pre-trained Models We release the checkpoints for **ERNIE-Gram _16G_** and **ERNIE-Gram _160G_** models which are pre-trained on the base-scale corpora (16GB text for BERT) and the large-scale corpora (160GB text for RoBERTa) respectively. - [**ERNIE-Gram _16G_**](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en-16g.tar.gz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_) - [**ERNIE-Gram _160G_**](https://ernie-github.cdn.bcebos.com/model-ernie-gram-en-160g.tar.gz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_) ## Fine-tuning on Downstream Tasks We compare the performance of [ERNIE-Gram](https://www.aclweb.org/anthology/2021.naacl-main.136/) with the existing SOTA pre-training models for natural language generation ([MPNet](https://arxiv.org/abs/2004.09297), [UniLMv2](https://arxiv.org/abs/2002.12804), [ELECTRA](https://arxiv.org/abs/2003.10555), [RoBERTa](https://arxiv.org/abs/1907.11692) and [XLNet](https://arxiv.org/abs/1906.08237)) on several language understanding tasks, including [GLUE benchmark](https://openreview.net/pdf?id=rJ4km2R5t7) (General Language Understanding Evaluation), [SQuAD](https://arxiv.org/abs/1606.05250) (Stanford Question Answering). ### GLUE benchmark The General Language Understanding Evaluation ([GLUE](https://openreview.net/pdf?id=rJ4km2R5t7)) is a multi-task benchmark consisting of various NLU tasks, which contains 1) pairwise classification tasks like language inference [MNLI](https://www.aclweb.org/anthology/N18-1101), [RTE](http://dx.doi.org/10.1007/11736790_9)), question answering (QNLI) and paraphrase detection (QQP, [MRPC](https://www.aclweb.org/anthology/I05-5002)), 2) single-sentence classification tasks like linguistic acceptability ([CoLA](https://www.aclweb.org/anthology/Q19-1040)), sentiment analysis ([SST-2](https://www.aclweb.org/anthology/D13-1170)) and 3) text similarity task ([STS-B](https://www.aclweb.org/anthology/S17-2001)). The results on GLUE are presented as follows: |Tasks| MNLI | QNLI | QQP | SST-2 | CoLA | MRPC | RTE | STS-B | AVG | | :--------| :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | |Metrics| ACC | ACC | ACC | ACC | MCC | ACC | ACC | PCC | AVG | | XLNet |86.8|91.7|91.4|94.7|60.2|88.2|74.0|89.5|84.5| | RoBERTa |87.6|92.8|91.9|94.8|63.6|90.2|78.7|91.2|86.4| | ELECTRA |88.8|93.2|91.5|95.2|67.7|89.5|82.7|91.2|87.5| | UniLMv2 |88.5|**93.5**|91.7|95.1|65.2|**91.8**|81.3|91.0|87.3| | MPNet |88.5|93.3|91.9|95.4|65.0|91.5|**85.2**|90.9|87.7| | **ERNIE-Gram** |**89.1**|93.2|**92.2**|**95.6**|**68.6**|90.7|83.8|**91.3**|**88.1**| Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `${TASK_DATA_PATH}` After the dataset is downloaded, you should run `sh ./utils/glue_data_process.sh $TASK_DATA_PATH` to convert the data format for training. If everything goes well, there will be a folder named `data` created with all the converted datas in it. ### SQuAD benchmark The Stanford Question Answering (SQuAD) tasks are designed to extract the answer span within the given passage conditioned on the question. We conduct experiments on [SQuAD1.1](https://www.aclweb.org/anthology/D16-1264) and [SQuAD2.0](https://www.aclweb.org/anthology/P18-2124) by adding a classification layer on the sequence outputs of ERNIE-Gram and predicting whether each token is the start or end position of the answer span. The results on SQuAD are presented as follows: | Tasks | SQuADv1 | SQuADv2 | | :-------------------------------------------------------- | :----------------------------: | :----------------------: | | Metrics | EM / F1 | EM / F1 | | RoBERTa |84.6 / 91.5|80.5 / 83.7| | XLNet |- / - | 80.2 / -| | ELECTRA |86.8 / - | 80.5 / -| | MPNet |86.8 / 92.5 | 82.8 / 85.6| | UniLMv2 |87.1 / 93.1 | 83.3 / 86.1| | **ERNIE-Gram** |**87.2** / **93.2** | **84.1** / **87.1**| The preprocessed data for SQuAD can be downloaded from [SQuADv1](https://ernie-github.cdn.bcebos.com/data-SQuADv1.tar.gz) and [SQuADv2](https://ernie-github.cdn.bcebos.com/data-SQuADv2.tar.gz). Please unpack them to `./data`. The preprocessed data for tasks involving long text can be downloaded from [RACE](https://ernie-github.cdn.bcebos.com/data-RACE.tar.gz), [IMDB](https://ernie-github.cdn.bcebos.com/data-IMDB.tar.gz) and [AG'news](https://ernie-github.cdn.bcebos.com/data-AG.tar.gz). Please unpack them to `./data`. ## Usage ### Install PaddlePaddle This code base has been tested with PaddlePaddle 2.0.0+, You can install PaddlePaddle follow [this site](https://www.paddlepaddle.org.cn/install/quick). ### Fine-tuning Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before running ERNIE-Gram. We have put the parameter configurations of the finetuning tasks in `./task_conf`. You can easily run finetuning through these configuration files. For example, you can finetune ERNIE-Gram model on RTE by ```script TASK="RTE" # MNLI, SST-2, CoLA, SQuADv1..., please see ./task_conf MODEL_PATH="./ernie-gram-160g" #path for pre-trained models sh run.sh ${TASK} ${MODEL_PATH} ``` The log of training and the evaluation results are in `log/*job.log.0`. To finetune on your own task data, you can refer to the data format we provide for processing your data. ### Employ Dynamic Computation Graph The ERNIE-Gram-zh code using dynamic graph is more concise and flexible, please refer to [ERNIE-Gram Dygraph](https://github.com/PaddlePaddle/ERNIE/tree/develop/ernie-gram) for specific use. ## Citation You can cite the paper as below: ``` @article{xiao2021ernie-gram, title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding}, author={Xiao, Dongling and Li, Yukun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng}, journal={arXiv preprint arXiv:2010.12148}, year={2021} } ``` ## Communication - [ERNIE homepage](https://wenxin.baidu.com/) - [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc. - QQ discussion group: 760439550 (ERNIE discussion group). - QQ discussion group: 958422639 (ERNIE discussion group-v2). - [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.