Skip to main content
Multimodal inputs let a model read more than plain text. NagaAI supports images, files, and audio across multiple generation APIs, but each public surface represents those inputs differently.

Support Matrix

APIImagesFiles / PDFsAudio inputNotes
Responsesinput_image partsinput_file partsinput_audio partsBest starting point for new multimodal LLM work
Chat Completionsimage_url blocksfile blocksinput_audio blocks on supported modelsUse for existing OpenAI-style chat clients
MessagesAnthropic image blocksAnthropic document blocksmodel- and provider-dependentUse for Anthropic-style content-block tooling

Shape At A Glance

Responses
input[]
message
content[]
input_text
input_image
input_file
input_audio
Chat Completions
messages[]
content[]
text
image_url
file
input_audio
Messages
messages[]
content[]
text
image
document
The main difference is not just which input types are supported, but how each API nests those inputs inside the request payload.

When To Use It

  • image or PDF analysis inside an LLM workflow
  • ask-and-answer over screenshots, receipts, or scanned documents
  • audio understanding inside a conversational model flow
from openai import OpenAI

client = OpenAI(
    base_url="https://api.naga.ac/v1",
    api_key="YOUR_API_KEY",
)

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "type": "message",
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "Summarize the key obligations in this policy PDF.",
                },
                {
                    "type": "input_file",
                    "filename": "policy.pdf",
                    "file_data": "https://example.com/policy.pdf",
                },
            ],
        }
    ],
)

print(response.output_text)

Important Boundaries

  • Responses accepts typed multimodal parts, but support still depends on the selected model
  • use the direct Images API for image generation and image edits
  • use the direct Audio API for transcription, translation, and text-to-speech

Common Pitfalls

  • assuming endpoint support automatically means model support
  • using a direct generation API when you actually need multimodal understanding inside an LLM turn
  • relying on opaque file references when a page documents inline URLs or data payloads instead