Chameleon Plus Fine-Tuning Pipeline

[🏠 Homepage] [🤗 AA-Chameleon-7B-Base Model] [🤗 AA-Chameleon-7B-Plus Model]

Highlights

  • We train the original Chameleon-7B model using a 4.6k subset from laion-art, and acquire the AA-Chameleon-7B-Base model (AA refers to Align-Anything) with image generation capabilities.

  • We then use our text-image interleaved input & output dataset and Align-Anything framework and algorithm to train this model, acquiring a AA-Chameleon-7B-Plus model, which is much better in text-image interleaved i/o task.

Pipeline

Environment setup

Currently, the official Transformer repo does not support Chameleon model with image output (see this PR for more details), so we rely on a certain fork of the repo.

After installing Align-Anything and correctly set up the environment, you can install the forked stable version of the repo by running:

pip install git+https://github.com/htlou/transformers.git@hantao_stable_cham

Pre-Tokenization

As chameleon expands the <image> in the input into image placeholders (<image> * 1024) in the processor, but add the real image tokens inside the forward function, we need to pretokenize the input and merge the real image tokens to the labels before doing the training.

Pretokenization for SFT

Set the template and the input key name correctly in pre_tokenize_example.py

and run:

python pre_tokenize_example.py --model_path $MODEL_PATH --input_path $INPUT_PATH --output_path $OUTPUT_PATH

Replace $MODEL_PATH, $INPUT_PATH and $OUTPUT_PATH with the correct paths.

Pretokenization for SFT, Accelerated

If you have multiple GPUs, you can use the accelerated version of the scriptto speed up the process.

Set the template, the input key name, and the number of processes and GPUs correctly in pre_tokenize_parallel_example.py

and run:

python pre_tokenize_parallel_example.py --model_path $MODEL_PATH --input_path $INPUT_PATH --output_path $OUTPUT_PATH --cache_dir $CACHE_DIR

Replace $MODEL_PATH, $INPUT_PATH, $OUTPUT_PATH and $CACHE_DIR with the correct paths.

Note

  • Despite utilizing GPUs, the primary bottleneck in pre-tokenization is the CPU’s capacity. It is crucial to adjust the number of processes and GPUs based on your CPU’s capabilities to ensure efficient performance.

  • The parallel pre-tokenization process incorporates a caching mechanism to reduce memory load. You should specify a directory for the cache using the --cache_dir option, where the cache data will be stored.

Pretokenization for DPO or RM

If you are dealing with preference dataset (for DPO or RM), set the template, the input key name, and the number of processes and GPUs correctly in preference_tokenize_example.py and run:

python preference_tokenize_example.py --model_path $MODEL_PATH --input_path $INPUT_PATH --output_path $OUTPUT_PATH --cache_dir $CACHE_DIR

Replace $MODEL_PATH, $INPUT_PATH, $OUTPUT_PATH and $CACHE_DIR with the correct paths.

Note

  • By default, we use parallel pre-tokenization to speed up the process here. If you want to fall back to sequential pre-tokenization, you can set the num_processes and num_gpus to 1.

Pretokenization for PPO

If you are dealing with prompt only dataset (for PPO), set the template, the input key name, and the number of processes and GPUs correctly in prompt_only_tokenize_example.py and run:

python prompt_only_tokenize_example.py --model_path $MODEL_PATH --input_path $INPUT_PATH --output_path $OUTPUT_PATH --cache_dir $CACHE_DIR

Replace $MODEL_PATH, $INPUT_PATH, $OUTPUT_PATH and $CACHE_DIR with the correct paths.

Note

  • By default, we use parallel pre-tokenization to speed up the process here. If you want to fall back to sequential pre-tokenization, you can set the num_processes and num_gpus to 1.

Training Model

After pre-tokenizing the dataset, you can start training the model using the following scripts.

Supervised Fine-Tuning

Add a script named sft_text_image_to_text_image.sh under the scripts file like this:

MODEL_NAME_OR_PATH="PKU-Alignment/AA-chameleon-7b-base"
TRAIN_DATASETS=""
PT_NAME=""
OUTPUT_DIR="../outputs/sft_text_image_to_text_image"
export WANDB_API_KEY=""
source ./setup.sh

deepspeed \
    --master_port ${MASTER_PORT} \
    --module align_anything.trainers.text_image_to_text_image.sft \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_datasets ${TRAIN_DATASETS} \
    --train_data_files ${PT_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --train_template ANYTHING_TI2TI \
    --train_split 'train' \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --save_interval 500 \
    --learning_rate 5e-5 \
    --epochs 3 \
    --lr_scheduler_type constant

and set up the correct model path and dataset path, then run:

bash scripts/sft_text_image_to_text_image.sh

Note

  • Supposed your pre-tokenized dataset is stored in /path/to/dataset/dataset_file_name.pt, then the TRAIN_DATASETS should be /path/to/dataset and the PT_NAME should be dataset_file_name.pt.

Direct Preference Optimization

Add a script named dpo_text_image_to_text_image.sh under the scripts file like this:

MODEL_NAME_OR_PATH="PKU-Alignment/AA-chameleon-7b-base"
TRAIN_DATASETS=""
PT_NAME=""
OUTPUT_DIR="../outputs/dpo_text_image_to_text_image"
export WANDB_API_KEY=""
source ./setup.sh

deepspeed \
    --master_port ${MASTER_PORT} \
    --module align_anything.trainers.text_image_to_text_image.dpo \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_datasets ${TRAIN_DATASETS} \
    --train_data_files ${PT_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --train_template ANYTHING_TI2TI \
    --train_split 'train' \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --save_interval 2500 \
    --learning_rate 5e-7 \
    --epochs 3 \
    --lr_scheduler_type cosine

and set up the correct model path and dataset path, then run:

bash scripts/sft_text_image_to_text_image.sh

Note

  • Supposed your pre-tokenized dataset is stored in /path/to/dataset/dataset_file_name.pt, then the TRAIN_DATASETS should be /path/to/dataset and the PT_NAME should be dataset_file_name.pt.

Reward Model

Add a script named rm_text_image_to_text_image.sh under the scripts file like this:

MODEL_NAME_OR_PATH="PKU-Alignment/AA-chameleon-7b-base"
TRAIN_DATASETS=""
TRAIN_PT_NAME=""
EVAL_DATASETS=""
EVAL_PT_NAME=""
OUTPUT_DIR="../outputs/rm_text_image_to_text_image"
export WANDB_API_KEY=""
source ./setup.sh

deepspeed \
    --master_port ${MASTER_PORT} \
    --module align_anything.trainers.text_image_to_text_image.rm \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_datasets ${TRAIN_DATASETS} \
    --output_dir ${OUTPUT_DIR} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --train_template ANYTHING_TI2TI \
    --train_split train \
    --train_data_files ${TRAIN_PT_NAME} \
    --eval_datasets ${EVAL_DATASETS} \
    --eval_data_files ${EVAL_PT_NAME} \
    --eval_template ANYTHING_TI2TI \
    --learning_rate 5e-6 \
    --epochs 3 \
    --lr_scheduler_type cosine \
    --save_interval 2500

and set up the correct model path and dataset path, then run:

bash scripts/rm_text_image_to_text_image.sh

Note

  • Supposed your pre-tokenized dataset is stored in /path/to/dataset/dataset_file_name.pt, then the TRAIN_DATASETS should be /path/to/dataset and the PT_NAME should be dataset_file_name.pt. Same for EVAL_DATASETS and EVAL_PT_NAME.

Proximal Policy Optimization

Add a script named ppo_text_image_to_text_image.sh under the scripts file like this:

ACTOR_MODEL_NAME_OR_PATH="PKU-Alignment/AA-chameleon-7b-base"
CRITIC_MODEL_NAME_OR_PATH=""
REWARD_MODEL_NAME_OR_PATH=""
TRAIN_DATASETS=""
TRAIN_PT_NAME=""
PTX_DATASETS=""
PTX_PT_NAME=""
OUTPUT_DIR="../outputs/ppo_text_image_to_text_image"

source ./setup.sh

deepspeed \
--master_port ${MASTER_PORT} \
--module align_anything.trainers.text_image_to_text_image.ppo \
--actor_model_name_or_path ${ACTOR_MODEL_NAME_OR_PATH} \
--reward_model_name_or_path ${REWARD_MODEL_NAME_OR_PATH} \
--reward_critic_model_name_or_path ${CRITIC_MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--train_template ANYTHING_TI2TI \
--train_data_files ${TRAIN_PT_NAME} \
--ptx_datasets ${PTX_DATASETS} \
--ptx_data_files ${PTX_PT_NAME} \
--ptx_template Llava \
--output_dir ${OUTPUT_DIR}

and set up the correct model path and dataset path, then run:

bash scripts/ppo_text_image_to_text_image.sh

Note

  • The CRITIC_MODEL_NAME_OR_PATH and REWARD_MODEL_NAME_OR_PATH should be the path to your reward model.

  • Supposed your pre-tokenized dataset is stored in /path/to/dataset/dataset_file_name.pt, then the TRAIN_DATASETS should be /path/to/dataset and the TRAIN_PT_NAME should be dataset_file_name.pt. Same for PTX_DATASETS and PTX_PT_NAME.

Model Evaluation

Batch Inference

Currently the batch inference of Chameleon is not integrated into the Align-Anything repo, so we need to use another repo. Here’s a forked (and revised to make it stable) version:

git clone https://github.com/htlou/mmsg_chameleon.git
cd mmsg_chameleon

Then set up the environment using

pip install -e .

After setting up the envrioment, set up the correct paths in scripts/interleaved_gen.sh and then run

bash scripts/interleaved_gen.sh

to do batch inference.

GPT-based Evaluation

Currently the GPT-based evaluation of text-image interleaved messages is not integrated into the Align-Anything repo, so we need to use another repo:

git clone https://github.com/htlou/gpt4_eval.git
cd gpt4_eval

You can set the INPUT_TYPE in the script.sh to interleaved-compare and run:

bash script.sh

to do the evaluation.

Model Performance

We trained the lm_head of Chameleon-7B model using the full data from laisonart, and acquired a AA-Chameleon-7B-base model (AA refers to Align-Anything) with image generation capabilities. Then, we used our text-image interleaved input & output dataset to finetune this model, acquiring a AA-Chameleon-7B-plus, which is much better in text-image interleaved i/o task. The model is trained for 3 epochs with a batch size of 2 and a learning rate of 5e-05 (similar to meta’s setting in their paper).

GPT-4o evaluation

We compared the performance of AA-Chameleon-7B+ with Anole-7b model on a text-image interleaved i/o dataset (with a length of 500). Here’s the results:

win

lose

equal

win rate (not counting equal)

240

175

85

57.8%

Generation Examples

Example #1

Input:

Input text:

Add a vintage filter to this bathroom image to give it a retro look and describe the steps involved in applying the filter.

Input image:

image

Output (before alignment):

Output text (before alignment):

Vintage look. Filters like this are designed to give old, vintage items a new life by giving them a distressed, nostalgic look. This process involves removing small particles from the material to create a rough, textured effect.

Output image (before alignment):

None

Output (after alignment):

Output text (after alignment):

I have applied a vintage filter to the bathroom image to give it a retro look. The process involved adjusting the color temperature to add a slight cool tone, reducing the saturation to give it a more muted appearance, and increasing the contrast to enhance the vintage feel.

Output image (after alignment):

image1

Example #2

Input:

Input text:

Create a detailed recipe for making the pastry shown in the image, including ingredients, preparation steps, and baking instructions.

Input image:

image2

Output (before alignment):

Output text (before alignment):

Absolutely! Here is a detailed recipe for making the pastry in the image:\n\nIngredients:\n\n* 1 cup all-purpose flour\n* 1/2 cup cold unsalted butter, cut into small pieces\n* 1/4 cup ice-cold water\n* 1 egg, lightly beaten\n* 1 tablespoon honey\n* 1/4 teaspoon salt\n

Output image (before alignment):

None

Output (after alignment):

Output text (after alignment):

<image>\n\nCooking Instructions:\n1. Preheat Oven: Preheat your oven to 350\u00b0F (175\u00b0C).\n2. Prepare Filling: In a small bowl, mix well the flour, sugar, eggs, and any other desired ingredients. To make it easier to handle, you can place the bowl of the mixture on a baking tray lined with parchment paper.\n3. Bake Donut: Place the prepared donut on the preheated baking tray. Bake for 8-10 minutes until the donut is golden brown and crispy on top.\n4. Cool and Serve: Allow the donut to cool slightly before serving. You can garnish it with a drizzle of chocolate sauce or caramel sauce if desired.\n\nEnjoy your homemade chocolate-filled donut!

Output image (after alignment): image3

Example #3

Input:

Input text:

Enhance the presentation of these pizzas with gourmet toppings and plating, and describe the steps involved.

Input image: image4

Output (before alignment):

Output text (before alignment):

The steps involved in presenting these homemade pizzas with gourmet toppings and plating them are as follows:nn1. Gather Ingredients:n - Use high-quality, fresh ingredients: chees, vegetables, fruits, and meats.n - Avoid using pre-concealed toppings: Instead, let your creations shine.nn2. Mix Ingredients:n - Mix all ingredients thoroughly but avoid mishandling them.n - Cover the mixing bowl with cling wrap to prevent cross-contamination.nn3. Set Up the Presentation:n - Transfer the mixture to a clean, oven-proof dish and preheat the oven to 400u00b0F (200u00b0C).n - Cut the pizza into appropriate portions and allow them to cool a few minutes.n - Plate the pizzas on a clean, clear surface to showcase their presentation.nn4. Garnish and Presentation:n - Add desired garnishes and toppings to add visual appeal and flavor.n - Secure the plating board with a clear, decorative overlay to protect the food from elements like moisture.nn5. Serve and Enjoy:n - Ensure the pizzas are hot and fresh to avoid any bacterial risks.n - Smooth any uneven edges and serve with a complementary dipping sauce or condiments.nnI’ve described the steps involved in preparing these homemade pizzas and their plating process.<image>

Output image (before alignment): image6

Output (after alignment):

Output text (after alignment):

I have enhanced the presentation of the pizzas by adding gourmet toppings. The pizzas now feature fresh basil leaves, prosciutto, cherry tomatoes, shaved Parmesan cheese, and a drizzle of balsamic glaze. The toppings are arranged artistically on each pizza to create a visually appealing and gourmet presentation.

Output image (after alignment):

image5