megatron.data.dataset_utils#

Description

Classes

MaskedLmInstance(index, label)

Functions

build_train_valid_test_datasets(data_prefix, ...)

compile_helper()

Compile helper function ar runtime.

create_masked_lm_predictions(tokens, ...[, ...])

Creates the predictions for the masked LM objective.

create_tokens_and_tokentypes(tokens_a, ...)

Merge segments A and B, add [CLS] and [SEP] and build tokentypes.

get_a_and_b_segments(sample, np_rng)

Divide sample into a and b segments.

get_datasets_weights_and_num_samples(...)

get_indexed_dataset_(data_prefix, data_impl, ...)

get_samples_mapping(indexed_dataset, ...)

Get a list that maps a sample index to a starting sentence index, end sentence index, and length

get_train_valid_test_split_(splits_string, size)

Get dataset splits from comma or '/' separated string list.

is_start_piece(piece)

Check if the current word piece is the starting piece (BERT).

pad_and_convert_to_numpy(tokens, tokentypes, ...)

Pad sequences and convert them to numpy.

truncate_segments(tokens_a, tokens_b, len_a, ...)

Truncates a pair of sequences to a maximum sequence length.