# How to tokenize a dataset? ## Step 1: get the right json format The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:
{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"} {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}The name of the `text` field of the json can be changed by using the `--json-key` flag in `preprocess_data.py`. The other metadata are optional and are not used in training. ## Step 2: Tokenize The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use `preprocess_data.py`. Set the `--dataset_impl` flag to `mmap`, `cached`, or `lazy`, respectively (default is `mmap`). An example script to prepare data for Falcon training is:
python3 tools/preprocess_data.py --input /scratch/dummy-data/train.json --output_prefix wiki-train --dataset_impl mmap --tokenizer_type FalconTokenizer --workers 2 --chunk_size 32 --append_eodThe output will be two files named, in this case, `my-bert_text_sentence.bin` and `my-bert_text_sentence.idx`. The `--data_path` specified in later BERT training is the full path and new filename, but without the file extension. Other options of `preprocess_data.py`: ``` input data: --input INPUT Path to input JSON --json_keys JSON_KEYS [JSON_KEYS ...] space separate listed of keys to extract from json --split_sentences Split documents into sentences. --keep_newlines Keep newlines between sentences when splitting. tokenizer: --tokenizer_type {BertWordPieceLowerCase,BertWordPieceCase,GPT2BPETokenizer,SentencePieceTokenizer,FalconTokenizer} What type of tokenizer to use. --vocab_file VOCAB_FILE Path to the vocab file --merge_file MERGE_FILE Path to the BPE merge file (if necessary). --append_eod Append an