megatron.tokenizer.bert_tokenization.WordpieceTokenizer#
- class megatron.tokenizer.bert_tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)#
Bases:
object
Runs WordPiece tokenziation.
- tokenize(text)#
Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
- For example:
input = “unaffable” output = [“un”, “##aff”, “##able”]
- Parameters:
text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
- Returns:
A list of wordpiece tokens.