megatron.data.ict_dataset.ICTDataset#
- class megatron.data.ict_dataset.ICTDataset(name, block_dataset, title_dataset, data_prefix, num_epochs, max_num_samples, max_seq_length, query_in_block_prob, seed, use_titles=True, use_one_sent_docs=False, binary_head=False)#
Bases:
Dataset
Dataset containing sentences and their blocks for an inverse cloze task.
- __getitem__(idx)#
Get an ICT example of a pseudo-query and the block of text from which it was extracted
- concat_and_pad_tokens(tokens, title=None)#
Concat with special tokens and pad sequence to self.max_seq_length
- get_block(start_idx, end_idx, doc_idx)#
Get the IDs for an evidence block plus the title of the corresponding document
- get_null_block()#
Get empty block and title - used in REALM pretraining