Specification — AI Native Data Format
If no human reads the data, there's no reason to write it in a human-readable format.
Principle
LLMs are Large Language Models — their strength is understanding intent, reasoning through nuance, and communicating in natural language. DCP respects this. It optimizes the input channel (structured data delivery), not the output. What comes out of the LLM is the LLM's domain.
The Problem
LLMs produce and consume text at extraordinary cost. Every token matters — in API billing, context window budget, and inference latency. Yet the data AI agents exchange with each other is overwhelmingly formatted for human readability: verbose JSON with repeated keys, natural language descriptions where structured data would suffice, self-documenting formats read by no one.
The question is simple: if only machines read this data, why are we formatting it for humans?
The industry approaches this as a JSON optimization problem — stripping syntax overhead while preserving key-value structure. DCP asks a different question: why have keys at all? If the consumer knows the schema, every key is a wasted token.
Core Idea
Data Cost Protocol (DCP) is a convention for delivering structured data to AI agents. The rules:
- Define a schema once — field names, order, and types declared in a header
- Write data by position — no keys, no labels, no repetition. The schema says what position 3 means
- Inline the schema with the data — no external documentation needed to interpret
This is not a new serialization format. It's a design discipline: strip everything the consumer doesn't need.
Before and After
Simple case
[
{ "id": 1, "name": "Alice", "score": 92 },
{ "id": 2, "name": "Bob", "score": 85 },
{ "id": 3, "name": "Charlie", "score": 88 }
]With DCP:
[schema: id, name, score]
[[1,"Alice",92],[2,"Bob",85],[3,"Charlie",88]]Real-world case: API monitoring data
A batch of API response metrics fed to an LLM for analysis:
[
{ "endpoint": "/v1/users", "method": "GET", "status": 200, "latency_ms": 42 },
{ "endpoint": "/v1/orders", "method": "POST", "status": 201, "latency_ms": 187 },
{ "endpoint": "/v1/auth", "method": "POST", "status": 200, "latency_ms": 95 },
{ "endpoint": "/v1/search", "method": "GET", "status": 200, "latency_ms": 312 }
]With DCP:
["$S","api-response:v1",4,"endpoint","method","status","latency_ms"]
["/v1/users","GET",200,42]
["/v1/orders","POST",201,187]
["/v1/auth","POST",200,95]
["/v1/search","GET",200,312]4 records: JSON repeats 4 key names × 4 rows = 16 keys. DCP states them once. ~50% metadata reduction. At scale (hundreds of records per analysis), the savings compound.
The $S Header — Schema-on-Wire
DCP data in the wild uses a compact header to declare which schema governs the rows that follow:
["$S", schema_id, ...field_names]$S— literal marker, signals "this is a schema declaration"schema_id— identifies the schema (e.g.,"knowledge:v1","hotmemo:v1")field_count— number of fields (allows O(1) validation without counting names)field_names— positional field names
Data rows follow immediately:
["$S","hotmemo:v1","layer","source","signal","detail"]
["quality","push","no-type-tag","auth jwt migration fix"]
["receptor","passive","suggest","engram_pull"]When both producer and consumer already know the schema, the header can be abbreviated to just the schema ID:
["$S","hotmemo:v1"]
["quality","push","no-type-tag","auth jwt migration fix"]The full header is presented on first contact. After that, high-capability agents work from the abbreviated form; lightweight models (≤4B) can only interpret field names.
Fixed-Length Principle
DCP arrays are fixed-length by design. Every record in a schema has the same number of fields, in the same order. This is what makes positional parsing, overlay, and cross-domain comparison work — index 4 always means the same thing.
Why fixed-length matters
- Parse cost: no key lookup, no field-count validation. Read position N, done.
- Overlay: stack arrays from different domains and compare by index. If lengths vary, alignment breaks.
- Schema as contract: the schema line declares the structure once. Every record honors it. No surprises.
Last-field escape hatch
The final field may optionally carry a free-form value (object, array, null). Interior fields stay positional. A well-designed schema rarely needs this.
["$S","api-event:v1","ts","endpoint","status","meta"]
["2026-04-01T09:00:00Z","/v1/orders",201,{"user_id":"u_8821","region":"ap-1"}]
["2026-04-01T09:00:01Z","/v1/auth", 401,{"reason":"token_expired","retry":false}]Positions 0–2 are fixed and positionally addressable. Position 3 (meta) is free-form — its shape varies per record. The fixed fields remain cheap to process; the escape hatch absorbs the irregular remainder without breaking the schema contract.
Nested arrays — $N
When a field contains an array of structured objects, DCP uses the $N marker (Nested). The sub-schema is declared once in the schema definition; the output references it by ID.
["$S","order:v1","order_id","status","items"]
["o001","shipped",["$N","order.items:v1",["A001",2],["B002",1]]]
["o002","pending",["$N","order.items:v1",["C003",3]]]
["o003","cancelled",["$N","order.items:v1"]]["$N", schema-id, row1, row2, ...] — rows present["$N", schema-id] — empty array, schema ID preserved for type information
Interior fields stay positional. The sub-schema governs the nested rows the same way the parent schema governs the outer rows.
Schema Registry
Schemas are centralized as JSON definitions in a registry. Each schema declares its fields, types, enums, and examples:
{
"$dcp": "schema",
"id": "hotmemo:v1",
"fields": ["layer", "source", "signal", "detail"],
"fieldCount": 4,
"types": {
"layer": { "type": "string", "enum": ["quality", "session", "trend", "meta", "receptor", "subsystem", "pre-neuron"] },
"source": { "type": "string" },
"signal": { "type": "string" },
"detail": { "type": "string" }
}
}The registry serves as the single source of truth. Schemas are available via API (GET /schemas, GET /schemas/:id) and embedded in tool descriptions.
Design Properties
Pre-agreed, not self-describing. JSON repeats keys per record for human browsability. DCP declares the schema once — like Protocol Buffers and MessagePack, but in text because LLMs consume text.
Position is meaning. The same convention as CSV, function arguments, and array indexing — applied to AI data delivery.
Schema travels with data. No external docs to drift out of sync. Read the header, parse the rows.
System → AI is the primary direction. DCP optimizes the input channel. For structured output (when needed), the shadow index can be re-presented as an output constraint — the AI responds within schema-defined ranges and choices. This is optional; most AI output is natural language and should remain so.
Normalize values for token cost. LLM tokenizers treat
0.36(2 tokens) differently from92(1 token). Use the simplest representation: integers 0-100 over floats 0.00-1.00, seconds over milliseconds,0/1overtrue/false.
Can LLMs Read DCP?
Yes. Format comparison testing (3 models × 4 tasks × 3 runs) shows DCP positional arrays match JSON accuracy at ≥2B parameters. When a model fails, it fails across all formats equally — format is not the bottleneck. See Format Comparison for full data.
Benchmark: DCP vs JSON vs Natural Language
Given that DCP is as readable as JSON, the remaining question is cost. We ran a reproducible benchmark comparing the same data in three formats across data size, parse speed, and LLM token cost.
Data Size (10,000 records)
| Format | bytes/record | vs DCP |
|---|---|---|
| DCP compact | 83 B | 1.00x |
| JSON (JSONL) | 182 B | 2.19x |
| Natural language | 223 B | 2.69x |
DCP is less than half the size of JSON, roughly a third of natural language. The ratio is stable across scales (100 to 10,000 records).
Parse Speed (10,000 records)
| Format | Total | per record | vs DCP |
|---|---|---|---|
| DCP compact | 10.9 ms | 1.09 μs | 1.00x |
| JSON (JSONL) | 15.8 ms | 1.58 μs | 1.45x |
| Natural language | 26.6 ms | 2.66 μs | 2.44x |
The NL figure is regex parsing against a controlled template. Real-world natural language requires LLM inference — orders of magnitude slower.
Token Cost (LLM context consumption)
| Format | 10,000 records | vs DCP | at $3/1M tokens |
|---|---|---|---|
| DCP compact | ~207K tokens | 1.00x | $0.62 |
| JSON (JSONL) | ~455K tokens | 2.19x | $1.36 |
| Natural language | ~557K tokens | 2.69x | $1.67 |
The Real Gap: Parsing Cost
DCP and JSON parse with zero LLM cost — string operations only. Natural language requires LLM inference to extract structured data:
1,000 records parsing cost:
DCP/JSON: $0.0000 (JSON.parse / array index)
NL: $0.2163 (Sonnet input + output tokens)The most expensive thing about natural language as a data format isn't the bytes — it's that parsing requires inference.
You minify JavaScript before deploying to production. Why wouldn't you minify data before sending it to an AI?