Multimodal Inputs - NagaAI Documentation

Multimodal inputs let a model read more than plain text. NagaAI supports images, files, and audio across multiple generation APIs, but each public surface represents those inputs differently.

Support Matrix

API	Images	Files / PDFs	Audio input	Notes
`Responses`	`input_image` parts	`input_file` parts	`input_audio` parts	Best starting point for new multimodal LLM work
`Chat Completions`	`image_url` blocks	`file` blocks	`input_audio` blocks on supported models	Use for existing OpenAI-style chat clients
`Messages`	Anthropic `image` blocks	Anthropic `document` blocks	model- and provider-dependent	Use for Anthropic-style content-block tooling

Shape At A Glance

Responses

input[]

message

content[]

input_text

input_image

input_file

input_audio

Chat Completions

messages[]

content[]

text

image_url

file

input_audio

Messages

messages[]

content[]

text

image

document

The main difference is not just which input types are supported, but how each API nests those inputs inside the request payload.

When To Use It

image or PDF analysis inside an LLM workflow
ask-and-answer over screenshots, receipts, or scanned documents
audio understanding inside a conversational model flow

Recommended Example

from openai import OpenAI

client = OpenAI(
    base_url="https://api.naga.ac/v1",
    api_key="YOUR_API_KEY",
)

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "type": "message",
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "Summarize the key obligations in this policy PDF.",
                },
                {
                    "type": "input_file",
                    "filename": "policy.pdf",
                    "file_data": "https://example.com/policy.pdf",
                },
            ],
        }
    ],
)

print(response.output_text)

Important Boundaries

Responses accepts typed multimodal parts, but support still depends on the selected model
use the direct Images API for image generation and image edits
use the direct Audio API for transcription, translation, and text-to-speech

Common Pitfalls

assuming endpoint support automatically means model support
using a direct generation API when you actually need multimodal understanding inside an LLM turn
relying on opaque file references when a page documents inline URLs or data payloads instead

​Support Matrix

​Shape At A Glance

​When To Use It

​Recommended Example

​Important Boundaries

​Common Pitfalls

​Related Docs

Support Matrix

Shape At A Glance

When To Use It

Recommended Example

Important Boundaries

Common Pitfalls

Related Docs