megatron.tokenizer.bert_tokenization.WordpieceTokenizer#

class megatron.tokenizer.bert_tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)#

Runs WordPiece tokenziation.

tokenize(text)#

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

Parameters:: text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
Returns:: A list of wordpiece tokens.