How to tokenize a dataset?#

Step 1: get the right json format#

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py. The other metadata are optional and are not used in training.

Step 2: Tokenize#

The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use preprocess_data.py. Set the --dataset_impl flag to mmap, cached, or lazy, respectively (default is mmap). An example script to prepare data for Falcon training is:

python3 tools/preprocess_data.py --input /scratch/dummy-data/train.json 
    --output_prefix wiki-train 
    --dataset_impl mmap 
    --tokenizer_type FalconTokenizer 
    --workers 2 
    --chunk_size 32
    --append_eod

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data_path specified in later BERT training is the full path and new filename, but without the file extension.

Other options of preprocess_data.py:

input data:                                                                   
  --input INPUT         Path to input JSON
  --json_keys JSON_KEYS [JSON_KEYS ...]      
                        space separate listed of keys to extract from json                                                                                   
  --split_sentences     Split documents into sentences.                                                                                                      
  --keep_newlines       Keep newlines between sentences when splitting.

tokenizer:
  --tokenizer_type {BertWordPieceLowerCase,BertWordPieceCase,GPT2BPETokenizer,SentencePieceTokenizer,FalconTokenizer}
                        What type of tokenizer to use.
  --vocab_file VOCAB_FILE
                        Path to the vocab file
  --merge_file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append_eod          Append an <eod> token to the end of a document.
  --lang LANG           Language to use for NLTK-powered sentence splitting.

output data:
  --output_prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset_impl {lazy,cached,mmap}

runtime:
  --workers WORKERS     Number of worker processes to launch
  --chunk_size CHUNK_SIZE
                        Chunk size assigned to each worker process
  --log_interval LOG_INTERVAL
                        Interval between progress updates
  --vocab_extra_ids VOCAB_EXTRA_IDS
  --vocab_extra_ids_list VOCAB_EXTRA_IDS_LIST
                        comma separated list of special vocab ids to add to the tokenizer
  --no_new_tokens       Whether to add special tokens (e.g. CLS, MASK, etc) in the sentenciepiece tokenizer or not

If you want to tokenize using llama tokenizer:

python tools/preprocess_data.py \
        --input=/path/to/data.json \
        --output_prefix=wiki-train \
        --dataset_impl=mmap \
        --tokenizer_type=SentencePieceTokenizer \
        --vocab_file=/path/to/tokenizer.model \
        --workers=2 \
        --chunk_size=32