Instructions to use okazaki-lab/japanese-gpt2-medium-unidic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use okazaki-lab/japanese-gpt2-medium-unidic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="okazaki-lab/japanese-gpt2-medium-unidic")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("okazaki-lab/japanese-gpt2-medium-unidic")
model = AutoModelForCausalLM.from_pretrained("okazaki-lab/japanese-gpt2-medium-unidic")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use okazaki-lab/japanese-gpt2-medium-unidic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "okazaki-lab/japanese-gpt2-medium-unidic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "okazaki-lab/japanese-gpt2-medium-unidic",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/okazaki-lab/japanese-gpt2-medium-unidic

SGLang

How to use okazaki-lab/japanese-gpt2-medium-unidic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "okazaki-lab/japanese-gpt2-medium-unidic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "okazaki-lab/japanese-gpt2-medium-unidic",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "okazaki-lab/japanese-gpt2-medium-unidic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "okazaki-lab/japanese-gpt2-medium-unidic",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use okazaki-lab/japanese-gpt2-medium-unidic with Docker Model Runner:
```
docker model run hf.co/okazaki-lab/japanese-gpt2-medium-unidic
```

japanese-gpt2-medium-unidic

This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer.

Reversed version is published here.

How to use

The model depends on PyTorch, fugashi with unidic-lite, and Hugging Face Transformers.

pip install torch torchvision torchaudio
pip install fugashi[unidic-lite]
pip install transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')

text = '今日はいい天気なので、'

bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed
input_ids = torch.tensor(input_ids).unsqueeze(0)
output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=30,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.0,
    num_return_sequences=1,
    pad_token_id=0,
    eos_token_id=32769,
)[0]

print(tokenizer.decode(output))

Model architecture

Transformer-based Language Model

Layers: 24
Heads: 16
Dimensions of hidden states: 1024

Training

We used a codebase provided by rinna Co., Ltd. for training.

The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31). We employed 8 A100 GPUs for 17 days. The perplexity on the validation set is 9.80.

Tokenization

Our tokenizer is based on the one provided by Tohoku NLP Group. The texts are tokenized by MeCab and then WordPiece.

The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).

License

Creative Commons Attribution-ShareAlike 4.0

Downloads last month: 784

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for okazaki-lab/japanese-gpt2-medium-unidic

Finetunes

1 model

okazaki-lab
/

japanese-gpt2-medium-unidic