megatron.data.bert_dataset.build_training_sample#

megatron.data.bert_dataset.build_training_sample(sample, target_seq_length, max_seq_length, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, pad_id, masked_lm_prob, np_rng, binary_head)#

Biuld training sample.

Parameters:

sample – A list of sentences in which each sentence is a list token ids.
target_seq_length – Desired sequence length.
max_seq_length – Maximum length of the sequence. All values are padded to this length.
vocab_id_list – List of vocabulary ids. Used to pick a random id.
vocab_id_to_token_dict – A dictionary from vocab ids to text tokens.
cls_id – Start of example id.
sep_id – Separator id.
mask_id – Mask token id.
pad_id – Padding token id.
masked_lm_prob – Probability to mask tokens.
np_rng – Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.