megatron.tokenizer.bert_tokenization#
Description
Tokenization classes.
Classes
|
Runs basic tokenization (punctuation splitting, lower casing, etc.). |
|
Runs end-to-end tokenziation. |
|
Runs WordPiece tokenziation. |
Functions
|
Converts a sequence of [tokens|ids] using the vocab. |
|
|
|
Converts text to Unicode (if it's not already), assuming utf-8 input. |
|
|
|
Loads a vocabulary file into a dictionary. |
|
Returns text encoded in a way suitable for print or tf.logging. |
Checks whether the casing config is consistent with the checkpoint name. |
|
|
Runs basic whitespace cleaning and splitting on a piece of text. |