megatron.tokenizer.gpt2_tokenization#

Description

Tokenization classes for OpenAI GPT.

Classes

GPT2Tokenizer(vocab_file, merges_file[, ...])

GPT-2 BPE tokenizer. Peculiarities:

Functions

bytes_to_unicode()

Returns list of utf-8 byte and a corresponding list of unicode strings.

get_pairs(word)

Return set of symbol pairs in a word.