megatron.tokenizer.bert_tokenization#

Description

Tokenization classes.

Classes

BasicTokenizer([do_lower_case])

Runs basic tokenization (punctuation splitting, lower casing, etc.).

FullTokenizer(vocab_file[, do_lower_case])

Runs end-to-end tokenziation.

WordpieceTokenizer(vocab[, unk_token, ...])

Runs WordPiece tokenziation.

Functions

convert_by_vocab(vocab, items)

Converts a sequence of [tokens|ids] using the vocab.

convert_ids_to_tokens(inv_vocab, ids)

convert_to_unicode(text)

Converts text to Unicode (if it's not already), assuming utf-8 input.

convert_tokens_to_ids(vocab, tokens)

load_vocab(vocab_file)

Loads a vocabulary file into a dictionary.

printable_text(text)

Returns text encoded in a way suitable for print or tf.logging.

validate_case_matches_checkpoint(...)

Checks whether the casing config is consistent with the checkpoint name.

whitespace_tokenize(text)

Runs basic whitespace cleaning and splitting on a piece of text.