megatron.data.realm_dataset_utils.get_block_samples_mapping#

megatron.data.realm_dataset_utils.get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epochs, max_num_samples, max_seq_length, seed, name, use_one_sent_docs=False)#

Get samples mapping for a dataset over fixed size blocks. This function also requires a dataset of the titles for the source documents since their lengths must be taken into account.

Returns:

samples_mapping (BlockSamplesMapping)