# How we used ShareGPT to create our benchmark dataset ## sg_90k_part1_html_cleaned.json ### Download ShareGPT dataset ``` https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json ``` ### Install Fastchat ``` pip install fschat ``` ### Clean data: ``` pip install polyglot pyicu pycld2 python -m fastchat.data.optional_clean --in sg_90k_part1_html_cleaned.json --out sg_90k_part1_html_cleaned_lang.json --keep-lang en ``` ### Extract first prompt ``` python extract_first.py --in-file sg_90k_part1_html_cleaned_lang.json --out-file sg_90k_part1_html_cleaned_lang_first.json ``` ### Sample data ``` python -m fastchat.data.sample --in sg_90k_part1_html_cleaned_lang_first.json --out sg_90k_part1_html_cleaned_lang_first_sampled.json --end 10000 --max-length 10000 ``` ### Sorted data We sort the requests by sequence length, placing the longest sequences first. This approach minimizes the amount of padding required and allows for early detection of out-of-memory. ``` python sort.py --data-dir sg_90k_part1_html_cleaned_lang_first_sampled.json --out-file sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json ``` ## ShareGPT_V3_filtered.json ### Download ShareGPT dataset ``` https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json ``` ### Install Transformers ``` pip install transformers ``` ### Filter conversations with too long prompts/responses, extract first turn, and randomly sample 500 prompts ``` python filter_dataset.py ``` ### Compare the response length distribution of sampled dataset with respect to initial dataset ``` pip install matplotlib numpy python compare_distributions.py ```