1 Star 0 Fork 1

modelee/bert-small-japanese

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
CC-BY-SA-4.0
language license datasets widget
ja
cc-by-sa-4.0
wikipedia
text
東京大学で[MASK]の研究をしています。

BERT small Japanese finance

This is a BERT model pretrained on texts in the Japanese language.

The codes for the pretraining are available at retarfi/language-pretraining.

Model architecture

The model architecture is the same as BERT small in the original ELECTRA paper; 12 layers, 256 dimensions of hidden states, and 4 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia.

The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.

The corpus file is 2.9GB, consisting of approximately 20M sentences.

Tokenization

The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.

The vocabulary size is 32768.

Training

The models are trained with the same configuration as BERT small in the original ELECTRA paper; 128 tokens per instance, 128 instances per batch, and 1.45M training steps.

Citation

@article{Suzuki-etal-2023-ipm,
  title = {Constructing and analyzing domain-specific language model for financial text mining}
  author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
  journal = {Information Processing & Management},
  volume = {60},
  number = {2},
  pages = {103194},
  year = {2023},
  doi = {10.1016/j.ipm.2022.103194}
}

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP21K12010.

--- language: ja license: cc-by-sa-4.0 datasets: - wikipedia widget: - text: 東京大学で[MASK]の研究をしています。 --- # BERT small Japanese finance This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/tree/v1.0). ## Model architecture The model architecture is the same as BERT small in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555); 12 layers, 256 dimensions of hidden states, and 4 attention heads. ## Training Data The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021. The corpus file is 2.9GB, consisting of approximately 20M sentences. ## Tokenization The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768. ## Training The models are trained with the same configuration as BERT small in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555); 128 tokens per instance, 128 instances per batch, and 1.45M training steps. ## Citation ``` @article{Suzuki-etal-2023-ipm, title = {Constructing and analyzing domain-specific language model for financial text mining} author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi}, journal = {Information Processing & Management}, volume = {60}, number = {2}, pages = {103194}, year = {2023}, doi = {10.1016/j.ipm.2022.103194} } ``` ## Licenses The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). ## Acknowledgments This work was supported by JSPS KAKENHI Grant Number JP21K12010.

简介

暂无描述 展开 收起
CC-BY-SA-4.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/modelee/bert-small-japanese.git
git@gitee.com:modelee/bert-small-japanese.git
modelee
bert-small-japanese
bert-small-japanese
main

搜索帮助

Cb406eda 1850385 E526c682 1850385