megatron.data.t5_dataset.build_training_sample#

megatron.data.t5_dataset.build_training_sample(sample, target_seq_length, max_seq_length, max_seq_length_dec, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, pad_id, masked_lm_prob, np_rng, bos_id=None, eos_id=None, sentinel_tokens=None)#

Build training sample.

Parameters:
  • sample – A list of sentences in which each sentence is a list token ids.

  • target_seq_length – Desired sequence length.

  • max_seq_length – Maximum length of the sequence. All values are padded to this length.

  • vocab_id_list – List of vocabulary ids. Used to pick a random id.

  • vocab_id_to_token_dict – A dictionary from vocab ids to text tokens.

  • cls_id – Start of example id.

  • sep_id – Separator id.

  • mask_id – Mask token id.

  • pad_id – Padding token id.

  • masked_lm_prob – Probability to mask tokens.

  • np_rng – Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.

  • bos_id – start of decoder example id

  • eos_id – end of generation id

  • sentinel_tokens – unique value to be substituted for every replaced span