CV・NLPハマりどころメモ

画像認識と自然言語処理を研究する上でうまくいかなかったことと,その対策をまとめる自分用のメモが中心.

自作データで学習したモデルを再度読み込んで2回目の学習を実行[Flair]

自作データを使って2回以上の学習を回したときにハマったのでまとめる.

筆者が自作データでNERの学習を行なったときに,1回目に学習したモデルを2回目の学習に引き継ぎたいと思った.

しかしながら,その方法は公式ドキュメントには書いていなかったので,自分で調べて解決した.

結論としては,SequenceTagger.load()を使って,1回目のモデルを読み込み,それをtaggerとして用いた.

以下,1回目に学習したモデルを2回目の学習に引き継ぐコード.

from flair.data import Corpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import (
    WordEmbeddings,
    CharacterEmbeddings,
    FlairEmbeddings,
    BertEmbeddings,
)
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from typing import List
from pathlib import Path
import sys
from docopt import docopt

def loadCorpus(data_folder):
    # define columns
    columns = {0: "text", 1: "ner"}

    # init a corpus using column format, data folder and the names of the train, dev and test files
    corpus: Corpus = ColumnCorpus(
        data_folder,
        columns,
        train_file="train.tsv",
        test_file="test.tsv",
        dev_file="devel.tsv",
    )
    return corpus

# 1回目の学習

datapath = "/root/data/input/my-ner-data/"

# 1. get the corpus
corpus: Corpus = loadCorpus(datapath)
print(corpus)

# 2. what tag do we want to predict?
tag_type = "ner"

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_objects: List[TokenEmbeddings] = []

embedding_objects.append(CharacterEmbeddings())
    
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_objects)

# 5. initialize sequence tagger
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
    use_crf=True,
)

# 6. initialize trainer
from flair.trainers import ModelTrainer

resultpath = "/root/output/flair_test"

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
resultpath = Path(resultpath) / "tagger_results" / "char"
trainer.train(
    str(resultpath),
    learning_rate=0.1,
    mini_batch_size=32,
    max_epochs=5,
    patience=5 
)

# 2回目の学習

best_model = "/root/output/flair_test/tagger_results/char/best-model.pt"

# 1. get the corpus
corpus: Corpus = loadCorpus(datapath)
print(corpus)

# 2. what tag do we want to predict?
tag_type = "ner"

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_objects: List[TokenEmbeddings] = []

embedding_objects.append(CharacterEmbeddings())
    
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_objects)

# 5. initialize sequence tagger
# tagger: SequenceTagger = SequenceTagger(
#     hidden_size=256,
#     embeddings=embeddings,
#     tag_dictionary=tag_dictionary,
#     tag_type=tag_type,
#     use_crf=True,
# )
    

# 2回目のtaggerは,SequenceTagger.loadで読み込んだモデルを使う.
tagger = SequenceTagger.load(best_model)

# 6. initialize trainer
from flair.trainers import ModelTrainer

resultpath = "/root/output/flair_test"

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
resultpath = Path(resultpath) / "tagger_results" / "char"
trainer.train(
    str(resultpath),
    learning_rate=0.1,
    mini_batch_size=32,
    max_epochs=5,
    patience=5 
)