megatron.tokenizer.bert_tokenization.WordpieceTokenizer#

class megatron.tokenizer.bert_tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)#

Bases: object

Runs WordPiece tokenziation.

tokenize(text)#

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Parameters:

text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.

Returns:

A list of wordpiece tokens.