megatron.tokenizer.gpt2_tokenization.GPT2Tokenizer#
- class megatron.tokenizer.gpt2_tokenization.GPT2Tokenizer(vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None)#
Bases:
object
- GPT-2 BPE tokenizer. Peculiarities:
Byte-level BPE
- convert_ids_to_tokens(ids, skip_special_tokens=False)#
Converts a sequence of ids in BPE tokens using the vocab.
- convert_tokens_to_ids(tokens)#
Converts a sequence of tokens into ids using the vocab.
- classmethod from_pretrained(pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs)#
Instantiate a PreTrainedBertModel from a pre-trained model file. Download and cache the pre-trained model file if needed.
- save_vocabulary(vocab_path)#
Save the tokenizer vocabulary and merge files to a directory.
- set_special_tokens(special_tokens)#
Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the special_tokens list.
- tokenize(text)#
Tokenize a string.