megatron.data.t5_dataset.build_training_sample#
- megatron.data.t5_dataset.build_training_sample(sample, target_seq_length, max_seq_length, max_seq_length_dec, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, pad_id, masked_lm_prob, np_rng, bos_id=None, eos_id=None, sentinel_tokens=None)#
- Build training sample. - Parameters:
- sample – A list of sentences in which each sentence is a list token ids. 
- target_seq_length – Desired sequence length. 
- max_seq_length – Maximum length of the sequence. All values are padded to this length. 
- vocab_id_list – List of vocabulary ids. Used to pick a random id. 
- vocab_id_to_token_dict – A dictionary from vocab ids to text tokens. 
- cls_id – Start of example id. 
- sep_id – Separator id. 
- mask_id – Mask token id. 
- pad_id – Padding token id. 
- masked_lm_prob – Probability to mask tokens. 
- np_rng – Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive. 
- bos_id – start of decoder example id 
- eos_id – end of generation id 
- sentinel_tokens – unique value to be substituted for every replaced span