BEBO format specification v2.1
The on-disk format for Ceradela cold archives. This document is authoritative — any divergence between this spec and the reference decoder is a bug in one or the other, and we'll fix.
Overview
BEBO is a columnar archive format. A file contains rows from one table grouped into row groups, each compressed independently with zstd. Per-column metadata lives in the header; per-column byte offsets enable O(1) column pruning on read.
File extensions
| Extension | What it is |
|---|---|
.cbebomth | Monthly archive — one month of one table's rows |
.cbeboqtr | Quarterly bundle — 3 monthly .cbebomth files wrapped in a TOC |
.cbeboyr | Yearly bundle — 4 quarterly .cbeboqtr files wrapped in a TOC |
.cbeboddl | Raw DDL snapshot — plain UTF-8 CREATE statement |
Monthly file layout (.cbebomth)
offset bytes content
────── ───── ─────────────────────────────
0 4 "BEBO" magic
4 1 version (currently 0x02)
5 1 flags (reserved, 0x00)
6 4 header length (little-endian uint32)
10 N header (zstd-compressed)
10+N varies row groups (each: 4-byte compLen + compressed payload)
end-12 4 footer length (little-endian uint32)
end-8 4 "BEBO" magic (trailer sentinel)
end-4 4 CRC32 of everything from offset 0 to end-4 Header section (after zstd decompression)
"BL2P" magic (4 bytes) — inner format tag
row count (uint32 LE)
column count (uint16 LE)
Per column:
encoding tag (uint8)
name length (uint8)
name bytes (UTF-8)
data offset (uint32 LE) — where this column's bytes start in the row group payload
data length (uint32 LE) — byte span of this column's data
encoding-specific metadata (see encoding table)
After columns:
header_len (uint32 LE) — byte offset where data section starts
data section — column bytes, one block per column Encoding tags
| Tag | Name | What it encodes |
|---|---|---|
| 0x01 | raw_int32 | Sequential int32 values (4 bytes each) |
| 0x02 | raw_int64 | Sequential int64 values (8 bytes each) |
| 0x03 | raw_uint16 | Small positive ints (<= 65535) |
| 0x04 | delta_int32 | Monotonic sequences — stores v[i] - v[i-1] |
| 0x05 | dict_uint8 | ≤ 256 unique int values, one byte per row |
| 0x06 | dict_uint16 | ≤ 65536 unique int values, two bytes per row |
| 0x07 | enum_uint8 | ≤ 256 unique strings each ≤ 255 chars; one byte per row |
| 0x08 | epoch_delta | Timestamps as seconds since a base epoch stored in the header |
| 0x09 | bitpack_bool | 8 booleans per byte |
| 0x0A | string_raw | Length-prefixed (uint16) UTF-8 strings, up to 65535 chars |
| 0x0B | nullable_int32 | Null bitmap (1 bit/row) + dense non-null int32 values |
| 0x0C | nullable_int64 | Null bitmap + dense non-null int64 values |
| 0x0D | nullable_string | Null bitmap + dense string_raw values |
| 0x0E | nullable_time | Null bitmap + dense epoch_delta values |
| 0x10 | rle | Run-length: {value, count} pairs. Rarely auto-selected. |
| 0x11 | for | Frame-of-reference: base + uint16 offsets for clustered ints |
| 0x12 | rle_dict | RLE over dict_uint8 indices |
| 0x13 | byte_split_32 | Scatter int32 bytes into 4 streams for better zstd compression |
| 0x14 | byte_split_64 | Scatter int64 bytes into 8 streams |
| 0x15 | array_int64 | Null bitmap + per-row [len:u32] + flat int64 values (PG integer[]/bigint[]) |
| 0x16 | array_string | Null bitmap + per-row [len:u32] + per-item [len:u16 + bytes] (PG text[]) |
| 0x17 | jsonb_raw | Null bitmap + per-row [len:u32] + raw JSONB bytes |
Encoding selection
The encoder picks automatically based on column content:
- If any value is null → use a nullable variant
- If all strings fit in 255 chars and ≤ 256 unique →
enum_uint8 - Else for strings →
string_rawornullable_string - For ints:
delta_int32for theidcolumn when sorted,dict_uint8/16for ≤ 65k unique values,byte_split_32/64otherwise - For timestamps →
epoch_deltaornullable_time - For
[]byte/json.RawMessage→jsonb_raw - For
[]int64/[]int32→array_int64 - For
[]string→array_string
Bundle layouts (.cbeboqtr, .cbeboyr)
[4] magic ("BEQR" for qtr, "BEYR" for yr)
[1] inner count (uint8, usually 3 or 4)
per inner:
[7] label (UTF-8, NUL-padded to 7 bytes)
[4] offset within payload (uint32 LE)
[4] length (uint32 LE)
[4] total payload size (uint32 LE)
[N] payload — concatenated inner blobs (already zstd-compressed internally)
[4] CRC32 of everything above Bundles never double-compress. The inner .cbebomth / .cbeboqtr files are already zstd-compressed row-group-by-row-group, and re-compressing them at the bundle level added overhead for < 1% compression benefit.
DDL snapshots (.cbeboddl)
Plain UTF-8. No magic, no versioning, no compression. Just the CREATE ... statement. The assumption is that DDL is tiny (a few KB) and lasts forever — optimizing for bytes here doesn't matter.
CRC32 placement
The trailing CRC32 covers the entire file (not including itself). A "BEBO" magic sentinel sits in the 4 bytes before it so decoders can verify they're reading the trailer, not a false match mid-file.
Versioning
Byte 4 of the file is the format version. Current version: 0x02.
Decoders accept older versions forever — breaking format changes
would require a new magic ("BEBO" → something else) and a six-month
migration window.
Reference implementation
The bebo CLI is the reference decoder:
github.com/ceradela/bebo-cli.
Any third-party implementation that passes bebo verify --deep on all test fixtures is considered spec-compliant.