Chat Completions

CometAPI routes Chat Completions to multiple providers — including OpenAI, Claude, and Gemini — through a single OpenAI-compatible interface. Switch between models by changing the model parameter; most OpenAI-compatible SDKs work by setting base_url to https://api.cometapi.com/v1.

Request parameters and response fields can vary significantly between model providers. Check the official documentation for the provider behind the model you use whenever you need the complete parameter list or provider-specific behavior. For example, reasoning_effort only applies to reasoning models (o-series, GPT-5.1+), and some models do not support logprobs or n > 1.

For OpenAI Pro models, o-series reasoning models, and Codex models, use the Responses endpoint instead. These model families have more complete support on the Responses API.

Message roles

Role	Description
`system`	Sets the assistant’s behavior and personality. Placed at the start of the conversation.
`developer`	Replaces `system` for newer models (o1+). Provides instructions the model should follow regardless of user input.
`user`	Messages from the end user.
`assistant`	Previous model responses, used to maintain conversation history.
`tool`	Results from tool/function calls. Must include `tool_call_id` matching the original tool call.

For newer models (GPT-4.1, GPT-5 series, o-series), prefer developer over system for instruction messages. Both work, but developer provides stronger instruction-following behavior.

Send multimodal input

Many models support images and audio alongside text. To send multimodal messages, use the array format for content:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe this image"},
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    }
  ]
}

The detail parameter controls image analysis depth:

low — faster, uses fewer tokens (fixed cost)
high — detailed analysis, more tokens consumed
auto — the model decides (default)

Stream responses

To receive incremental output, set stream to true. The response is delivered as Server-Sent Events (SSE), where each event contains a chat.completion.chunk object:

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

To include token usage statistics in streaming responses, set stream_options.include_usage to true. The usage data appears in the final chunk before [DONE].

Request structured output

To force the model to return valid JSON matching a specific schema, use response_format:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "result",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "answer": {"type": "string"},
          "confidence": {"type": "number"}
        },
        "required": ["answer", "confidence"],
        "additionalProperties": false
      }
    }
  }
}

JSON Schema mode (json_schema) guarantees the output matches your schema exactly. JSON Object mode (json_object) only guarantees valid JSON — the structure is not enforced.

Call tools and functions

To enable the model to call external functions, provide tool definitions:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

When the model decides to call a tool, the response will have finish_reason: "tool_calls" and the message.tool_calls array will contain the function name and arguments. You then execute the function and send the result back as a tool message with the matching tool_call_id.

Cross-provider notes

Parameter support across providers

Parameter	OpenAI GPT	Claude (via compat)	Gemini (via compat)
`temperature`	0–2	0–1	0–2
`top_p`	0–1	0–1	0–1
`n`	1–128	1 only	1–8
`stop`	Up to 4	Up to 4	Up to 5
`tools`	✅	✅	✅
`response_format`	✅	✅ (json_schema)	✅
`logprobs`	✅	❌	❌
`reasoning_effort`	o-series, GPT-5.1+	❌	❌ (use `thinking` for Gemini native)

max_tokens vs max_completion_tokens

max_tokens — The legacy parameter. Works with most models but is deprecated for newer OpenAI models.
max_completion_tokens — The recommended parameter for GPT-4.1, GPT-5 series, and o-series models. Required for reasoning models as it includes both output tokens and reasoning tokens.

CometAPI automatically handles the mapping when routing to different providers.

system vs developer role

system — The traditional instruction role. Works with all models.
developer — Introduced with o1 models. Provides stronger instruction-following for newer models. Falls back to system behavior on older models.

Use developer for new projects targeting GPT-4.1+ or o-series models.

FAQ

How to handle rate limits?

When encountering 429 Too Many Requests, implement exponential backoff:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://api.cometapi.com/v1",
    api_key="<COMETAPI_KEY>",
)

def chat_with_retry(messages, max_retries=3):
    for i in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5.4",
                messages=messages,
            )
        except RateLimitError:
            if i < max_retries - 1:
                wait_time = (2 ** i) + random.random()
                time.sleep(wait_time)
            else:
                raise

How to maintain conversation context?

Include the full conversation history in the messages array:

messages = [
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level programming language..."},
    {"role": "user", "content": "What are its main advantages?"},
]

What does `finish_reason` mean?

Value	Meaning
`stop`	Natural completion or hit a stop sequence.
`length`	Reached `max_tokens` or `max_completion_tokens` limit.
`tool_calls`	The model invoked one or more tool/function calls.
`content_filter`	Output was filtered due to content policy.

How to control costs?

Use max_completion_tokens to cap output length.
Choose cost-effective models (e.g., gpt-5.4-mini or gpt-5.4-nano for simpler tasks).
Keep prompts concise — avoid redundant context.
Monitor token usage in the usage response field.

Authorizations

Authorization

string

header

required

Bearer token authentication. Use your CometAPI key.

Body

application/json

model

string

default:gpt-5.4

required

Model ID to use for this request. See the Models page for current options.

Example:

"gpt-4.1"

messages

object[]

required

A list of messages forming the conversation. Each message has a role (system, user, assistant, or developer) and content (text string or multimodal content array).

Show child attributes

stream

boolean

If true, partial response tokens are delivered incrementally via server-sent events (SSE). The stream ends with a data: [DONE] message.

temperature

number

default:1

Sampling temperature between 0 and 2. Higher values (e.g., 0.8) produce more random output; lower values (e.g., 0.2) make output more focused and deterministic. Recommended to adjust this or top_p, but not both.

Required range: 0 <= x <= 2

top_p

number

default:1

Nucleus sampling parameter. The model considers only the tokens whose cumulative probability reaches top_p. For example, 0.1 means only the top 10% probability tokens are considered. Recommended to adjust this or temperature, but not both.

Required range: 0 <= x <= 1

integer

default:1

Number of completion choices to generate for each input message. Defaults to 1.

stop

string

Up to 4 sequences where the API will stop generating further tokens. Can be a string or an array of strings.

max_tokens

integer

Maximum number of tokens to generate in the completion. The total of input + output tokens is capped by the model's context length.

presence_penalty

number

default:0

Number between -2.0 and 2.0. Positive values penalize tokens based on whether they have already appeared, encouraging the model to explore new topics.

Required range: -2 <= x <= 2

frequency_penalty

number

default:0

Number between -2.0 and 2.0. Positive values penalize tokens proportionally to how often they have appeared, reducing verbatim repetition.

Required range: -2 <= x <= 2

logit_bias

object

A JSON object mapping token IDs to bias values from -100 to 100. The bias is added to the model's logits before sampling. Values between -1 and 1 subtly adjust likelihood; -100 or 100 effectively ban or force selection of a token.

user

string

A unique identifier for your end-user. Helps with abuse detection and monitoring.

max_completion_tokens

integer

An upper bound for the number of tokens to generate, including visible output tokens and reasoning tokens. Use this instead of max_tokens for GPT-4.1+, GPT-5 series, and o-series models.

response_format

object

Specifies the output format. Use {"type": "json_object"} for JSON mode, or {"type": "json_schema", "json_schema": {...}} for strict structured output.

Show child attributes

tools

object[]

A list of tools the model may call. Currently supports function type tools.

Show child attributes

tool_choice

default:auto

Controls how the model selects tools. auto (default): model decides. none: no tools. required: must call a tool.

logprobs

boolean

default:false

Whether to return log probabilities of the output tokens.

top_logprobs

integer

Number of most likely tokens to return at each position (0-20). Requires logprobs to be true.

Required range: 0 <= x <= 20

reasoning_effort

enum<string>

Controls the reasoning effort for o-series and GPT-5.1+ models.

Available options:

low,

medium,

high

stream_options

object

Options for streaming. Only valid when stream is true.

Show child attributes

service_tier

enum<string>

Specifies the processing tier.

Available options:

auto,

default,

flex,

priority

Response

Successful chat completion response.

string

Unique completion identifier.

Example:

"chatcmpl-abc123"

object

enum<string>

Available options:

chat.completion

Example:

"chat.completion"

created

integer

Unix timestamp of creation.

Example:

1774412483

model

string

The model used (may include version suffix).

Example:

"gpt-5.4-2025-07-16"

choices

object[]

Array of completion choices.

Show child attributes

usage

object

Show child attributes

service_tier

string

Example:

"default"

system_fingerprint

string | null

Example:

"fp_490a4ad033"

Overview

API Reference

Integration Guides

Libraries

Errors

Pricing & Billing

Support

Message roles

Send multimodal input

Stream responses

Request structured output

Call tools and functions

Cross-provider notes

FAQ

How to handle rate limits?

How to maintain conversation context?

What does `finish_reason` mean?

How to control costs?

Authorizations

Body

Response

Overview

API Reference

Integration Guides

Libraries

Errors

Pricing & Billing

Support

Documentation Index

​Message roles

​Send multimodal input

​Stream responses

​Request structured output

​Call tools and functions

​Cross-provider notes

​FAQ

​How to handle rate limits?

​How to maintain conversation context?

​What does finish_reason mean?

​How to control costs?

Authorizations

Body

Response

Message roles

Send multimodal input

Stream responses

Request structured output

Call tools and functions

Cross-provider notes

FAQ

How to handle rate limits?

How to maintain conversation context?

What does `finish_reason` mean?

How to control costs?