megatron.tokenizer.gpt2_tokenization.GPT2Tokenizer#

class megatron.tokenizer.gpt2_tokenization.GPT2Tokenizer(vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None)#

Bases: object

GPT-2 BPE tokenizer. Peculiarities:
  • Byte-level BPE

convert_ids_to_tokens(ids, skip_special_tokens=False)#

Converts a sequence of ids in BPE tokens using the vocab.

convert_tokens_to_ids(tokens)#

Converts a sequence of tokens into ids using the vocab.

classmethod from_pretrained(pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs)#

Instantiate a PreTrainedBertModel from a pre-trained model file. Download and cache the pre-trained model file if needed.

save_vocabulary(vocab_path)#

Save the tokenizer vocabulary and merge files to a directory.

set_special_tokens(special_tokens)#

Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the special_tokens list.

tokenize(text)#

Tokenize a string.