FreeLLMAPI – Unified OpenAI‑compatible Endpoint for 12 Free LLM Providers

FreeLLMAPI: A Unified OpenAI-compatible Endpoint for Free LLMs across 12 Providers

Overview FreeLLMAPI offers a single, OpenAI-compatible endpoint that aggregates the free tiers from a dozen AI providers. By routing requests through one /v1/chat/completions interface, it makes it possible to access a combined pool of working inference capacity — reportedly exceeding one billion tokens per month when the free tiers are stacked. This proxy keeps provider keys encrypted, and a smart router selects the best available model for each request, falling back to the next option when a provider is rate-limited or temporarily unavailable. The system tracks usage per key to help you stay under each provider’s free-tier caps, and it supports both streaming and non-streaming responses. The project emphasizes privacy, resilience, and a developer-centric experience, with an admin dashboard, analytics, and health checks to help you manage keys, chain priorities, and performance.

Why this exists In today’s AI landscape, almost every serious lab and service offers a free tier with thousands of requests or millions of tokens per month. Taken individually, these tiers amount to a toy sandbox rather than a serious inference backbone. When you combine them, you unlock substantial capacity that can power experiments, prototypes, and early-stage apps. But stacking free tiers by hand is a logistics headache: multiple SDKs, disparate rate limits, fragile fault tolerance, and the risk of hitting quotas mid-conversation. FreeLLMAPI solves this by presenting a single, OpenAI-compatible endpoint that transparently distributes requests across all added providers. Point any OpenAI client library at the local server, and you gain a robust, route-aware proxy that maximizes uptime and preserves a coherent conversation flow.

Supported providers FreeLLMAPI’s catalog spans a wide set of providers, each bringing unique strengths and model families. Here are the key players included in the current free-tier aggregation, with representative models or families to give you a sense of the coverage:

Google Gemini: Gemini 2.5 Flash with 3.x previews
Link: https://ai.google.dev
Groq: Llama 3.3, Llama 4, GPT-OSS, Qwen3
Link: https://groq.com
Cerebras: Qwen3 235B
Link: https://cerebras.ai
SambaNova: DeepSeek V3.x, Llama 4, Gemma 3
Link: https://cloud.sambanova.ai
Mistral: Large 3, Medium 3.5, Codestral, Devstral
Link: https://mistral.ai
OpenRouter: 21 free-tier models
Link: https://openrouter.ai
GitHub Models: GPT-4.1, GPT-4o
Link: https://github.com/marketplace/models
Cloudflare: Kimi K2, GLM-4.7, GPT-OSS, Granite 4
Link: https://developers.cloudflare.com/workers-ai
Cohere: Command R+, Command-A (trial)
Link: https://cohere.com
Z.ai (Zhipu): GLM-4.5, GLM-4.7 Flash
Link: https://docs.z.ai
NVIDIA: NIM (disabled by default in the catalog)
Link: https://build.nvidia.com
HuggingFace: Router → DeepSeek V4, Kimi K2.6, Qwen3
Link: https://huggingface.co/docs/inference-providers

Note: The platform’s catalog evolves, and provider availability or free-tier terms may change. The core value is the unified, OpenAI-compatible surface that orchestrates diverse free tiers behind a single endpoint.

Key features OpenAI-compatible

Post and get behavior aligns with OpenAI’s /v1/chat/completions and /v1/models endpoints. You can use the official client libraries or LangChain, LlamaIndex, Continue, Hermes, and other interoperable clients by simply changing the base_url.

Streaming and non-streaming

The system supports both streaming (Server-Sent Events) and non-streaming responses. Each provider adapter implements the same interface to keep consistency across the chain.

Tool calling

Beyond plain chat completions, FreeLLMAPI passes through OpenAI-style tools and toolchoice requests. Assistant toolcalls and tool messages traverse the multi-provider path, enabling multi-step tool-assisted conversations across the available backends.

Automatic fallover

If the chosen provider returns 429 (rate limit), 5xx (server error), or times out, the router automatically marks that provider as temporarily unavailable for the current key and retries with the next model in your configured fallback chain. This can involve up to 20 attempts per request.

Per-key rate tracking

Counters exist per (platform, model, key) for RPM (requests per minute), RPD (requests per day), TPM (tokens per minute), and TPD (tokens per day). The router uses these to select a key that stays under its free-tier caps, preserving a smoother, long-running experience.

Sticky sessions

Conversations are “sticky” for about 30 minutes, meaning subsequent turns tend to stay within the same model. This reduces sudden shifts in model behavior and mitigates the hallucination spike that can accompany frequent mid-conversation provider switches.

Encrypted key storage

Keys are AES-256-GCM encrypted at rest, with decryption happening in-memory just before a request, protecting credentials from exposure in storage.

Unified API key

Clients authenticate to the proxy using a single freellmapi-… bearer token. Upstream provider keys remain hidden from the client applications, simplifying security and governance.

Health checks

Periodic health probes classify keys and providers as healthy, rate_limited, invalid, or error. The router automatically skips dead options, improving resilience without manual intervention.

Admin dashboard

A React + Vite admin UI enables you to manage keys, reorder the fallback chain, inspect analytics, and even run prompts in a playground. The dashboard includes a dark mode for comfortable long sessions.

Analytics

Per-request logging captures latency, token usage, success rates, and provider-level breakdowns. These insights help you understand how the proxy is performing and where bottlenecks might lie.

Runs anywhere Node 20+

The solution is designed to run on common server environments (Windows, macOS, Linux) or compact systems such as a Raspberry Pi. An approximate footprint sits around 40 MB RSS when idle behind a supervisor like PM2 or systemd.

Visuals and dashboards

The project comes with visual assets to illustrate the architecture and dashboards that help you manage the system:
Keys page: shows provider credentials, status indicators, and last health checks.
Playground: a playground within the dashboard to experiment with prompts and see which provider served the request, including model IDs and latency.
Analytics: charts and breakdowns of usage and latency across time windows.

Images referenced in the project

Keys page: repo-assets/keys.png
Playground: repo-assets/playground.png
Analytics: repo-assets/analytics.png
Fallback chain: repo-assets/fallback-chain.png

How it works A simplified view helps illustrate the end-to-end flow:

Client interaction
You (the client) send a request to the unified endpoint via a standard OpenAI-compatible client, with a single API key (the freellmapi-… token).
The server receives the request and hands it to the router.
Per-request routing
The router consults the configured fallback chain, selects the highest-priority model that has an healthy key and is within rate limits, and decrypts the associated key to call the provider SDK.
Provider invocation
The chosen provider processes the request. If it returns a success, the response is streamed or delivered as a standard JSON payload, and the client sees a single, coherent response.
If the provider responds with rate-limiting, or if there is a timeout or 5xx error, the router marks the provider as temporarily unavailable for that request and retries with the next model in the chain. The key enters a short cooldown to avoid hammering.
Session continuity
For multi-turn conversations, the router keeps the session with the same model for roughly 30 minutes, reducing drift in the model’s behavior and mitigating long-term hallucinations caused by mid-conversation switches.
Observability
Each request is logged with latency, token counts, and a per-provider breakdown. Health statuses of keys and providers update in near real-time, ensuring the router can skip unhealthy options automatically.
Admin and control
The admin dashboard lets you reorder the fallback chain, rotate or revoke keys, and run prompts in a controlled playground to verify behavior before enabling keys in production-like flows.

Not yet supported FreeLLMAPI focuses on a specific, well-defined scope to keep the project focused and reliable. Features not yet implemented (or intentionally excluded) include:

Embeddings (/v1/embeddings)
Image generation (/v1/images/*)
Audio or speech processing
Vision or multimodal inputs (text-only message content)
Legacy completions (/v1/completions)
Moderation (/v1/moderations)
Multi-completion per request (n > 1)
Per-user billing or multi-tenant authentication
Any additional modules beyond the current scope The team welcomes contributions for these areas, with guidance in the Contributing section.

Getting started: Quick start Prerequisites

Node.js 20 or newer
npm

Steps

Clone the repository
git clone https://github.com/tashfeenahmed/freellmapi.git
Install dependencies
cd freellmapi
npm install
Generate an encryption key for at-rest key storage
cp .env.example .env
echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env
Start the server and the admin dashboard
npm run dev
Open http://localhost:5173 to access the Vite UI
On the Keys page, add provider keys, reorder the Fallback Chain as desired, and grab your unified API key from the Keys header
For production builds
npm run build
node server/dist/index.js

Using the API Any OpenAI-compatible client works against FreeLLMAPI. Here are representative usage patterns:

Python (OpenAI-compatible client)

baseurl should point to your local server, and apikey should be your unified key (freellmapi-…)
Example flow: a chat completion where the router selects the best available model automatically
The response includes routing details (e.g., X-Routed-Via header) so you can see which provider served the call

Curl example

Basic request
curl http://localhost:3001/v1/chat/completions \ -H "Authorization: Bearer freellmapi-your-unified-key" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "hi"}] }'

Streaming example

Streaming tokens are delivered in chunks; the client can assemble the final text as it arrives

Tool calling example

This demonstrates how OpenAI-style tool declarations can be used in a multi-provider flow
A model asks for a tool call; the tool is executed externally, and the result is fed back into the conversation
The final response includes the tool result and the assistant’s final content

Notes on usage

The platform uses a single unified key to access all downstream providers, reducing the surface area for key leakage.
The health checks and cooldown mechanisms are designed to help you avoid persistent failures while maximizing token usage within free tiers.
The feature set is intentionally conservative to maintain reliability across a broad set of providers.

Screenshots and visuals

Keys page: See provider credentials, status indicators, and last health check timestamps
Image: repo-assets/keys.png
Playground: Interact with a chat completion and see which provider served the request, including model IDs and latency
Image: repo-assets/playground.png
Analytics: View request volumes, success rates, tokens in and out, and provider-level breakdowns
Image: repo-assets/analytics.png

How it’s built The architecture emphasizes a clean separation of concerns and an extensible plugin-like approach to providers:

Router
Central decision-maker that selects the model and provider for each request
Rate-limit ledger
In-memory counters (RPM, RPD, TPM, TPD) backed by SQLite to track usage per key and provider
Provider adapters
One file per provider, implementing the common interface (chatCompletion and streamChatCompletion)
Health service
Periodically probes to keep each key’s status fresh (healthy, rate_limited, invalid, error)
Dashboard
React + Vite UI to manage keys, reorder fallback chains, view analytics, and try prompts in a playground
Storage
SQLite with AES-256-GCM envelope encryption for key storage

Limitations Free-tier stacking comes with trade-offs. It’s essential to be honest about the constraints when designing applications:

No frontier models
The free-tier catalog tops out around Llama 3.3 70B, GLM-4.5, Qwen 3 Coder, Gemini 2.5 Pro. Expect no GPT-5 or Claude Opus-level capabilities within the free pool.
Intelligence dips as the day progresses
The top-ranked free models have limited day-to-day caps; as limits are hit, the router slides down to smaller, less capable models.
Latency variability
Some providers (Cerebras, Groq) are very fast; others are slower. Overall latency depends on the current availability of healthy keys.
Free tiers can change without notice
Providers can tighten or remove free tiers, leading to 429s or auth errors until the catalog is updated. Re-seed scripts live in server/src/scripts/
No SLA
This is a self-hosted, experimental proxy, not a guaranteed service level agreement.
Local-first approach
There is no multi-tenant authentication; the project is intended for personal or single-user use.
Some providers may require self-hosting adjustments
Providers’ terms and conditions apply to how you use their keys, even when traffic is proxied.

Contributing Contributions are welcome. Some good starting PR ideas:

Add a provider
Copy an existing provider adapter (for example, openai-compat.ts) as a template, wire it into the provider index, seed the models, and add tests.
Add an endpoint
Extend the architecture to support embeddings, images, or moderations as new endpoints.
Improve the router
Implement cost-aware routing, latency-weighted prioritization, or regional pinning to further optimize performance and cost.
Dashboard polish
Add richer analytics visuals, improved key rotation UX, and batch key import features.
Documentation
Expand code examples and provide client-library snippets for additional languages.

Development loop

npm install
npm run dev (server on :3001, dashboard on :5173 with hot module reloading)
npm test (Vitest suite with tests across providers, routes, router, and rate limits)

Contributors

A vibrant set of contributors is listed, including those who maintain the keys, routes, dashboards, and tests.

Terms of Service review A recent ToS review (May 2026) evaluated self-hosted, single-user setups against provider terms. Key takeaways:

Google Gemini: ⚠️ Caution — March 2026 ToS narrows use to professional or business purposes; self-hosted developer proxies may still be defensible, but proceed with caution.
Groq: ✅ Likely OK — GroqCloud Services Agreement allows customer applications integration.
Cerebras: ✅ Likely OK — Permitted; explicit prohibition on selling or transferring API keys; single-owner usage is allowed.
Mistral: ✅ Likely OK — APIs allowed for personal/internal business use.
OpenRouter: ✅ Likely OK — April 2026 ToS tightened on resale and competing services; private single-user proxy remains feasible.
SambaNova: ⚠️ Ambiguous — EULA suggests restrictions around resale and service bureau use; single-user with no third-party access is likely fine.
Cloudflare Workers AI: ⚠️ Ambiguous — No explicit anti-proxy clause; general Self-Serve Subscription terms may apply.
NVIDIA NIM: ⚠️ Caution — Evaluation-only terms; “production” deployment may be restricted; the catalog default disables this provider.
GitHub Models: ⚠️ Caution — Free tier framed for experimentation and prototyping; not intended for production-scale usage.
Cohere: ❌ Avoid — Terms explicitly restrict personal, family, or household purposes; avoid in a personal/prototype deployment.
Zhipu (Z.ai): ✅ Likely OK — Personal/non-commercial research carve-out exists in platform docs.
Zhipu (api.z.ai): ⚠️ Caution — Singapore entity with anti-redirect clauses; proxy usage could be read as disallowed in some scenarios.
Ollama Cloud: ✅ Likely OK — Free plan permits cloud-model access with some concurrency and session caps; no explicit anti-proxy found.

Bottom line: Most providers can be used in a personal, non-production context with a self-hosted proxy, but always verify the latest terms before deployment. This review is informational and not legal advice; each project participant should read the terms and make their own decision.

Disclaimer This project is designed for personal experimentation and learning. Free tiers are intended for prototyping, not production-grade workloads. If you build a real product using FreeLLMAPI, plan to switch to paid APIs when you ship. Upstream provider terms apply, even when traffic is proxied through this project, and compliance is the responsibility of the user.

Star history

If you’re curious about project momentum, you can view star history illustrating the project’s adoption over time:
Star History Chart: https://api.star-history.com/chart?repos=tashfeenahmed/freellmapi&type=date&legend=top-left

License

Images and visuals (embedded references)

Keys: repo-assets/keys.png
Playground: repo-assets/playground.png
Analytics: repo-assets/analytics.png
Fallback chain: repo-assets/fallback-chain.png

Closing thoughts FreeLLMAPI stitches together a diverse landscape of freely accessible LLMs, providing developers with a pragmatic, single-point interface to experiment with multiple foundations. It lowers the barrier to trying new models, testing ideas, and prototyping conversational AI that benefits from the strengths of several providers. While the free tiers bring constraints — from caps and latency to evolving terms — the unified approach unlocks a powerful sandbox for AI research, education, and early-stage product exploration. If you’re curious about what you can build with a shared, OpenAI-compatible endpoint, FreeLLMAPI offers a compelling, community-friendly path to explore today.

FreeLLMAPI: One OpenAI-compatible endpoint for twelve free LLM providers

Enjoying this project?

GitHub - tashfeenahmed/freellmapi: FreeLLMAPI: One OpenAI-compatible endpoint for twelve free LLM providers

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category