Your AI agent reads a table called orders. It doesn’t know the schema version. It doesn’t know the license. It doesn’t know whether that column was deprecated last quarter. It just consumes, reasons, and responds — with confidence, and without context.
That’s the problem the Open Knowledge Format (OKF) was designed to solve.
OKF is an open specification introduced by Google that formalizes knowledge as a directory of markdown files with YAML frontmatter. No new runtime. No proprietary SDK. Just a vendor-neutral, agent- and human-readable format for describing what your data is, where it lives, how it’s structured, and who owns it.
OKF turns blind data access into informed, auditable reasoning — the descriptor is the layer agents load before they query.
What OKF actually looks like
An OKF bundle is a directory. Inside it, each concept — a table, a metric, a dataset — gets its own markdown file. A structured YAML header captures the machine-readable facts; the markdown body holds the human-readable explanation.
---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, revenue]
timestamp: 2026-05-28T14:30:00Z
---
# Schema
| Column | Type | Description |
|--------|------|-------------|
| `order_id` | STRING | Globally unique order identifier. |
| `customer_id` | STRING | FK to customers.md. |
# Joins
Joined with customers on `customer_id`.
The format is minimal by design. Only type is required. Producers define their own schemas; consumers parse what they understand and skip what they don’t. The result is a living knowledge bundle that lives in version control, renders on GitHub, and is parseable by any LLM — no translation layer required.
The primary angle: OKF as context packaging for agents
The most underappreciated consequence of OKF is what it does at the context-ingestion step of an agentic system.
Right now, an agent touching a data source has to infer what it’s working with from the data itself — column names, sample values, whatever was in the prompt. That inference is unreliable at best and hallucinogenic at worst. OKF gives the agent a structured descriptor it can load before it reasons:
- What is this? (
type,title,description) - Where does it live? (
resource— a direct link to the BigQuery console, S3 path, or API endpoint) - When was it last updated? (
timestamp) - What does it relate to? (markdown cross-links between concept files)
In an MCP context, an OKF bundle is a natural fit for a resource exposed by an MCP server. The agent fetches the descriptor for the table it’s about to query, loads the schema and join relationships, then executes. The difference in reasoning quality is not subtle — the agent stops guessing and starts knowing.
This also solves the provenance problem for audit. When an agent logs “I queried orders at 2026-06-26T10:00Z,” you want to know which schema version it saw. An OKF descriptor with a timestamp field gives you that anchor. Combined with an MCP gateway that logs every tool call, you get a complete chain: which agent, which descriptor version, which query, which result.
If you’re thinking about context engineering, OKF is the layer that packages knowledge before it enters the window. The descriptor is structured, typed, versioned, and compact — exactly the properties that matter when every token is a budget decision.
The data infrastructure angle: OKF in AI training pipelines
OKF also addresses a chronic pain in AI data pipelines: undocumented dataset dependencies.
Training and fine-tuning pipelines reference datasets that change without warning — schema drift, license changes, coverage gaps. OKF makes these dependencies explicit. A dataset descriptor in your pipeline repository captures the schema at training time, the source license, and any known quality caveats. When something breaks downstream, you have a versioned record of what the model was trained on.
The format’s tags field is particularly useful here. You can tag datasets by sensitivity level, compliance scope, or data domain, then build policy checks that gate which tagged datasets a given pipeline can access. This is the data-contract pattern applied to AI pipelines — and it’s composable with any existing data catalog because OKF is just files.
The platform engineering angle: OKF as an IDP primitive
If you run an internal developer platform — Backstage, SwaggerHub, a custom service catalog — OKF is the data-contract layer you’ve been missing.
Today, most IDPs model services, APIs, and infrastructure well. They model data poorly. The result: a team wants to know what tables their service can access, and the answer is “ask the data team.” OKF closes that gap. A data team publishes an OKF bundle for their domain; the IDP ingests it; service owners can discover, link to, and reference it in their service descriptors.
In regulated industries — aviation, finance, healthcare — this has compliance implications. An OKF descriptor with a license field and a timestamp creates an auditable record of what data a service was consuming and under what terms. That’s the kind of paper trail an audit requires and most platforms can’t currently produce.
Getting started: add an OKF descriptor to a dataset your agent already uses
The fastest path to value is narrow. Pick one dataset your agent already queries. Create a single markdown file next to your agent code:
knowledge/
└── tables/
└── orders.md
Add the YAML frontmatter — type, title, description, resource, timestamp. Write a one-paragraph markdown body explaining the schema and any join relationships. Load it into your agent’s context before the query step.
That’s it. You’ve just given your agent structured knowledge about one data source. Now repeat for the next one.
The full OKF specification and reference implementations — including an enrichment agent that auto-generates descriptors and a static HTML visualizer — are available at the Google OKF announcement.
Related posts: