megatron.data.ict_dataset.ICTDataset#

class megatron.data.ict_dataset.ICTDataset(name, block_dataset, title_dataset, data_prefix, num_epochs, max_num_samples, max_seq_length, query_in_block_prob, seed, use_titles=True, use_one_sent_docs=False, binary_head=False)#

Bases: Dataset

Dataset containing sentences and their blocks for an inverse cloze task.

__getitem__(idx)#

Get an ICT example of a pseudo-query and the block of text from which it was extracted

concat_and_pad_tokens(tokens, title=None)#

Concat with special tokens and pad sequence to self.max_seq_length

get_block(start_idx, end_idx, doc_idx)#

Get the IDs for an evidence block plus the title of the corresponding document

get_null_block()#

Get empty block and title - used in REALM pretraining