megatron.data.biencoder_dataset_utils.get_block_samples_mapping#
- megatron.data.biencoder_dataset_utils.get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epochs, max_num_samples, max_seq_length, seed, name, use_one_sent_docs=False)#
Get samples mapping for a dataset over fixed size blocks. This function also requires a dataset of the titles for the source documents since their lengths must be taken into account.
- Returns:
samples_mapping (BlockSamplesMapping)