[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

DarkLight1337 · 2025-05-01T18:57:59Z

🚀 The feature, motivation and pitch

Currently, we expect image_url, audio_url etc. to be inside the messages that are passed to the chat template. We would like to expand this to supporting image, audio etc. inputs, just like in HuggingFace Transformers:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

To avoid having to pass multi-modal inputs separately, we propose the following extension:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

This lets us pass multi-modal data such as PIL images to LLM.chat directly without having to encode them into base64 URLs.

Alternatives

No response

Additional context

cc @ywang96 @Isotr0py @hmellor

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

sfeng33 · 2025-05-01T19:15:29Z

I can help work on it this week

ywang96 · 2025-05-01T22:55:08Z

@sfeng33 Thank you for the interest!

@DarkLight1337 We should think about how to separate the code paths between openai compatible format and HF-compatible format.

In particular, currently LLM.chat leverages on the same payload format as the online openai compatible inference, so IMO we should create a new protocol (maybe HFChatCompletionMessageParam, but keep in mind this new format to support PIL image isn't really directly compatible with HF) for this particular purpose so that it does not accidentally affect other codepaths.

vllm/vllm/entrypoints/chat_utils.py

Lines 152 to 153 in 9b70e2b

    
           ChatCompletionMessageParam = Union[OpenAIChatCompletionMessageParam, 
        
                                              CustomChatCompletionMessageParam]

Doing so also allows us to support more different patterns that huggingface has in mind if any.

DarkLight1337 added feature request New feature or request good first issue Good for newcomers labels May 1, 2025

DarkLight1337 added this to Multi-modality Core May 1, 2025

DarkLight1337 mentioned this issue May 1, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

53 tasks

DarkLight1337 moved this to Todo in Multi-modality Core May 1, 2025

DarkLight1337 assigned sfeng33 May 2, 2025

DarkLight1337 moved this from Todo to In Progress in Multi-modality Core May 2, 2025

ywang96 linked a pull request May 9, 2025 that will close this issue

[Draft] Support PIL Image in llm.chat #17919

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

DarkLight1337 commented May 1, 2025 •

edited

Loading

sfeng33 commented May 1, 2025

ywang96 commented May 1, 2025 •

edited

Loading

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

Comments

DarkLight1337 commented May 1, 2025 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

sfeng33 commented May 1, 2025

ywang96 commented May 1, 2025 • edited Loading

DarkLight1337 commented May 1, 2025 •

edited

Loading

ywang96 commented May 1, 2025 •

edited

Loading