megatron.data.biencoder_dataset_utils.BlockSampleData#

class megatron.data.biencoder_dataset_utils.BlockSampleData(start_idx, end_idx, doc_idx, block_idx)#

Bases: object

A struct for fully describing a fixed-size block of data as used in REALM

Parameters:
  • start_idx – for first sentence of the block

  • end_idx – for last sentence of the block (may be partially truncated in sample construction)

  • doc_idx – the index of the document from which the block comes in the original indexed dataset

  • block_idx – a unique integer identifier given to every block.