Tokenizer Playground¶
Visualize how different tokenizers break text into tokens. Supports tiktoken encodings (GPT-4, GPT-3.5, GPT-2, o200k), plus character-level and whitespace tokenizers for comparison. Shows token count, character count, per-token text, and token IDs.
Overview¶
Data Source
tiktoken library — cl100k_base, o200k_base, gpt2, p50k_base, p50k_edit, r50k_base encodings.
Endpoints¶
POST /tokenizer/tokenize — Tokenize text¶
curl -X POST "https://atomgpt.org/tokenizer/tokenize" \
-H "Authorization: Bearer sk-XYZ" \
-H "Content-Type: application/json" \
-d '{"text": "SrTiO3 is a perovskite oxide", "tokenizer": "cl100k_base"}'
| Field | Type | Default | Description |
|---|---|---|---|
text |
string | required | Input text (max 50,000 characters) |
tokenizer |
string | cl100k_base |
Tokenizer name (see below) |
Available tokenizers:
| Name | Description |
|---|---|
gpt-4 / gpt-4o / gpt-3.5-turbo |
cl100k_base (GPT-4/3.5 family) |
gpt-4o (o200k) |
o200k_base (GPT-4o optimized) |
gpt-2 |
GPT-2 BPE tokenizer |
cl100k_base |
Direct encoding name |
o200k_base |
Direct encoding name |
p50k_base |
Codex-era encoding |
character |
Character-level (1 char = 1 token) |
whitespace |
Whitespace + punctuation split |
Response:
{
"tokenizer": "cl100k_base",
"text": "SrTiO3 is a perovskite oxide",
"n_tokens": 9,
"n_chars": 28,
"tokens": ["Sr", "Ti", "O", "3", " is", " a", " per", "ovsk", "ite oxide"],
"token_ids": [21521, 46465, 46, 18, 374, 264, 824, 17487, 1029]
}
Python Examples¶
import requests
response = requests.post(
"https://atomgpt.org/tokenizer/tokenize",
headers={
"Authorization": "Bearer sk-XYZ",
"Content-Type": "application/json",
},
json={"text": "SrTiO3 is a perovskite oxide", "tokenizer": "cl100k_base"},
)
data = response.json()
print(f"Tokens: {data['n_tokens']}, Chars: {data['n_chars']}")
for tok, tid in zip(data["tokens"], data["token_ids"]):
print(f" [{tid}] {repr(tok)}")
import requests
text = "La2CuO4 superconductor Tc=40K"
H = {"Authorization": "Bearer sk-XYZ", "Content-Type": "application/json"}
for tok in ["cl100k_base", "o200k_base", "gpt-2", "character"]:
r = requests.post("https://atomgpt.org/tokenizer/tokenize",
headers=H, json={"text": text, "tokenizer": tok})
d = r.json()
print(f" {tok}: {d['n_tokens']} tokens")
AGAPI Agent¶
from agapi.agents import AGAPIAgent
import os
agent = AGAPIAgent(api_key=os.environ.get("AGAPI_KEY"))
response = agent.query_sync("Tokenize SrTiO3 with GPT-4 tokenizer")
print(response)