Here, we provide our pipeline for generating KodCode dataset. Please ensure you are in the pipeline
folder when running the commands.
To generate synthetic questions, we first need to put seed questions/snippets/docs in the ../seeds
folder.
Then, we can run the following command to generate questions. Available modes are leetcode
, algorithm
, data_structure
, package
, apps
, codeforces
, code_contests
, taco
, docs
, and prefill
.
python step1.1_gen_questions.py --total_prompts [total_prompts] --mode [mode]
Example: We take leetcode
as an example and generate 100 questions:
python step1.1_gen_questions.py --total_prompts 100 --mode leetcode
You can now find the generated questions in the ../demo/KodCode_leetcode_100_1741214688
folder, where leetcode
is the mode, 100
is the number of questions generated, and 1741214688
is the timestamp to distinguish different runs.
The file name in this example is KodCode_seeds2questions_leetcode_100_1741214688.jsonl
. seeds2questions
stands for this file is used for generating questions from seeds.
We then call LLMs to generate questions from the seeds. You can choose to use either GPT-4o (set llm as gpt
) or open models (set llm as open_model
). Current we support vllm
, huggingface
and together
engines for open models.
bash step1.2_completion.sh [input_file] [llm] [model_name(optional)]
Example: We use open models to generate questions from the seeds:
bash step1.2_completion.sh ../demo/KodCode_leetcode_100_1741214688/KodCode_seeds2questions_leetcode_100_1741214688.jsonl open_model
You can now find the output file as KodCode_seeds2questions_leetcode_100_1741214688_results.jsonl
in the input folder. Note that the file name is the same as the input file, but with _results
appended to the end, indicating that the file contains the completion results of the LLM generation.
To do this step, simply run the following command.
python step1.3_proccess_and_sanitize.py --input_file [file_name]
Example: We filter out the questions and perform deduplication we generated from Step 1.2.
python step1.3_proccess_and_sanitize.py --input_file ../demo/KodCode_leetcode_100_1741214688/KodCode_seeds2questions_leetcode_100_1741214688_results.jsonl
You can now find the output file as KodCode_questions2sv_leetcode_100_1741214688_sanitized_prepared.jsonl
in the input folder. We use _prepared
to indicate that this file is ready for LLM completion. questions2sv
stands for this file is used for generating solutions and testsfrom questions.
After you get the filtered instructions, you can run the following command to generate solutions and tests. Num_trials is the number of attempts for each question.
bash step2.1_completion.sh [file_name] [num_trials] [llm] [model_name(optional)]
Example: We use GPT-4o to generate solutions and tests. We set num_trials as 3.
bash step2.1_completion.sh ../demo/KodCode_leetcode_100_1741214688/KodCode_questions2sv_leetcode_100_1741214688_sanitized_prepared.jsonl 3 gpt
You can now find three new files KodCode_questions2sv_leetcode_100_1741214688_sanitized_prepared_results{0,1,2}.jsonl
in the input folder.
This step will generate unit tests for each solution. The input folder contains trials of solutions and tests. It will automatically find all files from Step 2.1.
python step2.2_gen_unit_tests.py --input_folder [folder_name]
A folder starts with unit_test_
will be generated, which contains the executable unit tests for each solution.
Example: We use the folder generated from Step 2.1.
python step2.2_gen_unit_tests.py --input_folder ../demo/KodCode_leetcode_100_1741214688
This step will run all the tests and generate the results. By default, we use parallel
to run the tests.
Option 1: local environment
bash step2.3_run_all_tests.sh [unit_test_folder_name]
Option 2: docker environment (recommended)
docker run --gpus all -it --rm \
--entrypoint bash \
-v $(pwd)/..:/app \
-w /app/pipeline \
zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1 \
-c "bash step2.3_run_all_tests.sh [unit_test_folder_name]"
Example: We use the unit testfolder generated from Step 2.2 and run all tests using local environment.
bash step2.3_run_all_tests.sh ../demo/KodCode_leetcode_100_1741214688/self_verification_KodCode_leetcode_100_1741214688
This step will generate verified triplets for each solution.
python step2.4_gen_verified_triplets.py --unit_test_folder [unit_test_folder_name]
Example: We use the unit test folder generated from Step 2.3.
python step2.4_gen_verified_triplets.py --test_folder ../demo/KodCode_leetcode_100_1741214688/self_verification_KodCode_leetcode_100_1741214688
After this step, you will get the verified question-solution-test triplets. In this example, we will get the file Verified_KodCode_leetcode_100_1741214688.json
in demo
folder.
To convert the style of the generated data from instruct to complete, we can run the following command. This will generate a new folder in the ../demo
folder, and the name of the folder is Complete_<input_file_name>
.
python step3.1.1_style_converter.py --input_file [file_name]
Example: We use the file generated from Step 2.4.
python step3.1.1_style_converter.py --input_file ../demo/Verified_KodCode_leetcode_100_1741214688.json
You can now find the output file as Verified_KodCode_leetcode_100_1741214688_i2c_prepared.json
in the ../demo/Complete_Verified_KodCode_leetcode_100_1741214688
folder.
We can now use GPT-4o to generate the completion questions.
bash step3.1.2_completion.sh [file_name] [llm]
Example: We use GPT-4o to generate the completion questions.
bash step3.1.2_completion.sh ../demo/Complete_Verified_KodCode_leetcode_100_1741214688/Complete_KodCode_leetcode_100_1741214688_i2c_prepared.json gpt
You can now find the output file as Verified_KodCode_leetcode_100_1741214688_i2c_prepared_results.jsonl
in the ../demo/Complete_Verified_KodCode_leetcode_100_1741214688
folder.
This step will organize the data into the format of the post-training data by extracting the completion from the GPT-4o completion results.
python step3.1.3_gen_complete_triplets.py --input_file [file_name]
Example: We use the file generated from Step 3.1.2.
python step3.1.3_gen_complete_triplets.py --input_file ../demo/Complete_Verified_KodCode_leetcode_100_1741214688/Complete_KodCode_leetcode_100_1741214688_i2c_prepared_results.jsonl
You can now find the output file as Verified_KodCode_leetcode_100_1741214688_Complete.json
in the ../demo/Complete_Verified_KodCode_leetcode_100_1741214688
folder.
This step will combine the instruct and complete triplets into a single file. Note that the kodcode_exp_name
is the name of the experiment, which is used to distinguish different runs.
python step3.2_combine_kodcode.py --kodcode_exp_name [kodcode_exp_name]
Example: We use the files generated from Step 2.4 and Step 3.1.3. Here, the kodcode_exp_name
is KodCode_leetcode_100_1741214688
.
python step3.2_combine_kodcode.py --kodcode_exp_name KodCode_leetcode_100_1741214688
You can now find the output file as KodCode_leetcode_100_1741214688.json
in the ../demo
folder.
Now we are generating the SFT data. We first need to process the data for the completion pipeline. This step will generate a new folder in the ../demo
folder, and the name of the folder is SFT_<input_file_name>
.
python step3.3.1_proccess_sft.py --input_file [file_name]
Example: We use the file generated from Step 3.2.
python step3.3.1_proccess_sft.py --input_file ../demo/KodCode_leetcode_100_1741214688.json
You can now find the output file as SFT_KodCode_leetcode_100_1741214688/KodCode_leetcode_100_1741214688_prepared.jsonl
in the ../demo
folder.
This step will generate the SFT data. Since we are using DeepSeek-R1 and together engine, we need to set the together API Key in the environment variable TOGETHER_API_KEY
. Do not start with Bearer, just the key.
export TOGETHER_API_KEY=[your_together_api_key]
bash step3.3.2_completion_sft.sh [file_name] [num_trials] [llm] [model_name(optional)] [engine(optional)] [max_tokens(optional)]
Example: The following command will generate the SFT data using DeepSeek-R1 with 3 trials.
bash step3.3.2_completion_sft.sh ../demo/SFT_KodCode_leetcode_100_1741214688/SFT_KodCode_leetcode_100_1741214688_prepared.jsonl 3 open_model deepseek-ai/DeepSeek-R1 together 16384
You can now find three files as SFT_KodCode_leetcode_100_1741214688_results{0,1,2}.jsonl
in the ../demo/SFT_KodCode_leetcode_100_1741214688
folder, representing the results of the three trials.
This step will generate the unit tests for the SFT data.
python step3.3.3_gen_unit_tests_sft.py --input_folder [folder_name]
Example: We use the folder contains the SFT data generated from Step 3.3.2 with 3 trials. Since we are using DeepSeek-R1, we set the model_nickname
as r1
for distinguishing in test folder names.
python step3.3.3_gen_unit_tests_sft.py --input_folder ../demo/SFT_KodCode_leetcode_100_1741214688 --model_nickname r1
You can now find the unit test folder as cross_verification_KodCode_leetcode_100_1741214688
in the ../demo/SFT_KodCode_leetcode_100_1741214688
folder.
This step will run all the tests for the SFT data.
bash step3.3.4_run_all_tests_sft.sh [unit_test_folder_name]
Example: We use the unit test folder generated from Step 3.3.3.
bash step3.3.4_run_all_tests_sft.sh ../demo/SFT_KodCode_leetcode_100_1741214688/cross_verification_KodCode_leetcode_100_1741214688
This step will generate the SFT data with correctness labels.
python step3.3.5_gen_sft_datasets.py --input_folder [folder_name]
Example: We use the unit test folder executed in Step 3.3.4.
python step3.3.5_gen_sft_datasets.py --input_folder ../demo/SFT_KodCode_leetcode_100_1741214688/cross_verification_KodCode_leetcode_100_1741214688
The final output folder should now appears as SFT_KodCode_leetcode_100_1741214688.json
in the ../demo
folder. For each question, there is a correctness label named as r1_correctness
. If the response is correct, the label is True
, otherwise it is False
.