Specification — AI Native Data Format

If no human reads the data, there's no reason to write it in a human-readable format.

Principle

LLMs are Large Language Models — their strength is understanding intent, reasoning through nuance, and communicating in natural language. DCP respects this. It optimizes the input channel (structured data delivery), not the output. What comes out of the LLM is the LLM's domain.

The Problem

LLMs produce and consume text at extraordinary cost. Every token matters — in API billing, context window budget, and inference latency. Yet the data AI agents exchange with each other is overwhelmingly formatted for human readability: verbose JSON with repeated keys, natural language descriptions where structured data would suffice, self-documenting formats read by no one.

The question is simple: if only machines read this data, why are we formatting it for humans?

The industry approaches this as a JSON optimization problem — stripping syntax overhead while preserving key-value structure. DCP asks a different question: why have keys at all? If the consumer knows the schema, every key is a wasted token.

Core Idea

Data Cost Protocol (DCP) is a convention for delivering structured data to AI agents. The rules:

Define a schema once — field names, order, and types declared in a header
Write data by position — no keys, no labels, no repetition. The schema says what position 3 means
Inline the schema with the data — no external documentation needed to interpret

This is not a new serialization format. It's a design discipline: strip everything the consumer doesn't need.

Before and After

Simple case

json

[
  { "id": 1, "name": "Alice", "score": 92 },
  { "id": 2, "name": "Bob", "score": 85 },
  { "id": 3, "name": "Charlie", "score": 88 }
]

With DCP:

[schema: id, name, score]
[[1,"Alice",92],[2,"Bob",85],[3,"Charlie",88]]

Real-world case: API monitoring data

A batch of API response metrics fed to an LLM for analysis:

json

[
  { "endpoint": "/v1/users", "method": "GET", "status": 200, "latency_ms": 42 },
  { "endpoint": "/v1/orders", "method": "POST", "status": 201, "latency_ms": 187 },
  { "endpoint": "/v1/auth", "method": "POST", "status": 200, "latency_ms": 95 },
  { "endpoint": "/v1/search", "method": "GET", "status": 200, "latency_ms": 312 }
]

With DCP:

["$S","api-response:v1",4,"endpoint","method","status","latency_ms"]
["/v1/users","GET",200,42]
["/v1/orders","POST",201,187]
["/v1/auth","POST",200,95]
["/v1/search","GET",200,312]

4 records: JSON repeats 4 key names × 4 rows = 16 keys. DCP states them once. ~50% metadata reduction. At scale (hundreds of records per analysis), the savings compound.

The `$S` Header — Schema-on-Wire

DCP data in the wild uses a compact header to declare which schema governs the rows that follow:

["$S", schema_id, ...field_names]

$S — literal marker, signals "this is a schema declaration"
schema_id — identifies the schema (e.g., "knowledge:v1", "hotmemo:v1")
field_count — number of fields (allows O(1) validation without counting names)
field_names — positional field names

Data rows follow immediately:

["$S","hotmemo:v1","layer","source","signal","detail"]
["quality","push","no-type-tag","auth jwt migration fix"]
["receptor","passive","suggest","engram_pull"]

When both producer and consumer already know the schema, the header can be abbreviated to just the schema ID:

["$S","hotmemo:v1"]
["quality","push","no-type-tag","auth jwt migration fix"]

The full header is presented on first contact. After that, high-capability agents work from the abbreviated form; lightweight models (≤4B) can only interpret field names.

Fixed-Length Principle

DCP arrays are fixed-length by design. Every record in a schema has the same number of fields, in the same order. This is what makes positional parsing, overlay, and cross-domain comparison work — index 4 always means the same thing.

Why fixed-length matters

Parse cost: no key lookup, no field-count validation. Read position N, done.
Overlay: stack arrays from different domains and compare by index. If lengths vary, alignment breaks.
Schema as contract: the schema line declares the structure once. Every record honors it. No surprises.

Last-field escape hatch

The final field may optionally carry a free-form value (object, array, null). Interior fields stay positional. A well-designed schema rarely needs this.

["$S","api-event:v1","ts","endpoint","status","meta"]
["2026-04-01T09:00:00Z","/v1/orders",201,{"user_id":"u_8821","region":"ap-1"}]
["2026-04-01T09:00:01Z","/v1/auth",  401,{"reason":"token_expired","retry":false}]

Positions 0–2 are fixed and positionally addressable. Position 3 (meta) is free-form — its shape varies per record. The fixed fields remain cheap to process; the escape hatch absorbs the irregular remainder without breaking the schema contract.

Nested arrays — `$N`

When a field contains an array of structured objects, DCP uses the $N marker (Nested). The sub-schema is declared once in the schema definition; the output references it by ID.

["$S","order:v1","order_id","status","items"]
["o001","shipped",["$N","order.items:v1",["A001",2],["B002",1]]]
["o002","pending",["$N","order.items:v1",["C003",3]]]
["o003","cancelled",["$N","order.items:v1"]]

["$N", schema-id, row1, row2, ...] — rows present
["$N", schema-id] — empty array, schema ID preserved for type information

Interior fields stay positional. The sub-schema governs the nested rows the same way the parent schema governs the outer rows.

Schema Registry

Schemas are centralized as JSON definitions in a registry. Each schema declares its fields, types, enums, and examples:

json

{
  "$dcp": "schema",
  "id": "hotmemo:v1",
  "fields": ["layer", "source", "signal", "detail"],
  "fieldCount": 4,
  "types": {
    "layer": { "type": "string", "enum": ["quality", "session", "trend", "meta", "receptor", "subsystem", "pre-neuron"] },
    "source": { "type": "string" },
    "signal": { "type": "string" },
    "detail": { "type": "string" }
  }
}

The registry serves as the single source of truth. Schemas are available via API (GET /schemas, GET /schemas/:id) and embedded in tool descriptions.

Design Properties

Pre-agreed, not self-describing. JSON repeats keys per record for human browsability. DCP declares the schema once — like Protocol Buffers and MessagePack, but in text because LLMs consume text.
Position is meaning. The same convention as CSV, function arguments, and array indexing — applied to AI data delivery.
Schema travels with data. No external docs to drift out of sync. Read the header, parse the rows.
System → AI is the primary direction. DCP optimizes the input channel. For structured output (when needed), the shadow index can be re-presented as an output constraint — the AI responds within schema-defined ranges and choices. This is optional; most AI output is natural language and should remain so.
Normalize values for token cost. LLM tokenizers treat 0.36 (2 tokens) differently from 92 (1 token). Use the simplest representation: integers 0-100 over floats 0.00-1.00, seconds over milliseconds, 0/1 over true/false.

Can LLMs Read DCP?

Yes. Format comparison testing (3 models × 4 tasks × 3 runs) shows DCP positional arrays match JSON accuracy at ≥2B parameters. When a model fails, it fails across all formats equally — format is not the bottleneck. See Format Comparison for full data.

Benchmark: DCP vs JSON vs Natural Language

Given that DCP is as readable as JSON, the remaining question is cost. We ran a reproducible benchmark comparing the same data in three formats across data size, parse speed, and LLM token cost.

Data Size (10,000 records)

Format	bytes/record	vs DCP
DCP compact	83 B	1.00x
JSON (JSONL)	182 B	2.19x
Natural language	223 B	2.69x

DCP is less than half the size of JSON, roughly a third of natural language. The ratio is stable across scales (100 to 10,000 records).

Parse Speed (10,000 records)

Format	Total	per record	vs DCP
DCP compact	10.9 ms	1.09 μs	1.00x
JSON (JSONL)	15.8 ms	1.58 μs	1.45x
Natural language	26.6 ms	2.66 μs	2.44x

The NL figure is regex parsing against a controlled template. Real-world natural language requires LLM inference — orders of magnitude slower.

Token Cost (LLM context consumption)

Format	10,000 records	vs DCP	at $3/1M tokens
DCP compact	~207K tokens	1.00x	$0.62
JSON (JSONL)	~455K tokens	2.19x	$1.36
Natural language	~557K tokens	2.69x	$1.67

The Real Gap: Parsing Cost

DCP and JSON parse with zero LLM cost — string operations only. Natural language requires LLM inference to extract structured data:

1,000 records parsing cost:
  DCP/JSON: $0.0000  (JSON.parse / array index)
  NL:       $0.2163  (Sonnet input + output tokens)

The most expensive thing about natural language as a data format isn't the bytes — it's that parsing requires inference.

You minify JavaScript before deploying to production. Why wouldn't you minify data before sending it to an AI?

Specification — AI Native Data Format ​

Principle ​

The Problem ​

Core Idea ​

Before and After ​

Simple case ​

Real-world case: API monitoring data ​

The $S Header — Schema-on-Wire ​

Fixed-Length Principle ​

Why fixed-length matters ​

Last-field escape hatch ​

Nested arrays — $N ​

Schema Registry ​

Design Properties ​

Can LLMs Read DCP? ​

Benchmark: DCP vs JSON vs Natural Language ​

Data Size (10,000 records) ​

Parse Speed (10,000 records) ​

Token Cost (LLM context consumption) ​

The Real Gap: Parsing Cost ​

Specification — AI Native Data Format

Principle

The Problem

Core Idea

Before and After

Simple case

Real-world case: API monitoring data

The `$S` Header — Schema-on-Wire

Fixed-Length Principle

Why fixed-length matters

Last-field escape hatch

Nested arrays — `$N`

Schema Registry

Design Properties

Can LLMs Read DCP?

Benchmark: DCP vs JSON vs Natural Language

Data Size (10,000 records)

Parse Speed (10,000 records)

Token Cost (LLM context consumption)

The Real Gap: Parsing Cost