language | license | datasets | widget | ||||
---|---|---|---|---|---|---|---|
ja |
cc-by-sa-4.0 |
|
|
This is a BERT model pretrained on texts in the Japanese language.
The codes for the pretraining are available at retarfi/language-pretraining.
The model architecture is the same as BERT small in the original ELECTRA paper; 12 layers, 256 dimensions of hidden states, and 4 attention heads.
The models are trained on the Japanese version of Wikipedia.
The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.
The corpus file is 2.9GB, consisting of approximately 20M sentences.
The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.
The vocabulary size is 32768.
The models are trained with the same configuration as BERT small in the original ELECTRA paper; 128 tokens per instance, 128 instances per batch, and 1.45M training steps.
@article{Suzuki-etal-2023-ipm,
title = {Constructing and analyzing domain-specific language model for financial text mining}
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
journal = {Information Processing & Management},
volume = {60},
number = {2},
pages = {103194},
year = {2023},
doi = {10.1016/j.ipm.2022.103194}
}
The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.
This work was supported by JSPS KAKENHI Grant Number JP21K12010.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
1. 开源生态
2. 协作、人、软件
3. 评估模型