BEBO format specification v2.1

The on-disk format for Ceradela cold archives. This document is authoritative — any divergence between this spec and the reference decoder is a bug in one or the other, and we'll fix.

Overview

BEBO is a columnar archive format. A file contains rows from one table grouped into row groups, each compressed independently with zstd. Per-column metadata lives in the header; per-column byte offsets enable O(1) column pruning on read.

File extensions

Extension	What it is
`.cbebomth`	Monthly archive — one month of one table's rows
`.cbeboqtr`	Quarterly bundle — 3 monthly .cbebomth files wrapped in a TOC
`.cbeboyr`	Yearly bundle — 4 quarterly .cbeboqtr files wrapped in a TOC
`.cbeboddl`	Raw DDL snapshot — plain UTF-8 CREATE statement

Monthly file layout (.cbebomth)

offset   bytes   content
──────   ─────   ─────────────────────────────
0        4       "BEBO" magic
4        1       version (currently 0x02)
5        1       flags (reserved, 0x00)
6        4       header length (little-endian uint32)
10       N       header (zstd-compressed)

10+N     varies  row groups (each: 4-byte compLen + compressed payload)

end-12   4       footer length (little-endian uint32)
end-8    4       "BEBO" magic (trailer sentinel)
end-4    4       CRC32 of everything from offset 0 to end-4

Header section (after zstd decompression)

"BL2P" magic (4 bytes)          — inner format tag
row count (uint32 LE)
column count (uint16 LE)

Per column:
  encoding tag (uint8)
  name length (uint8)
  name bytes (UTF-8)
  data offset (uint32 LE) — where this column's bytes start in the row group payload
  data length (uint32 LE) — byte span of this column's data
  encoding-specific metadata (see encoding table)

After columns:
  header_len (uint32 LE)       — byte offset where data section starts
  data section                 — column bytes, one block per column

Encoding tags

Tag	Name	What it encodes
0x01	raw_int32	Sequential int32 values (4 bytes each)
0x02	raw_int64	Sequential int64 values (8 bytes each)
0x03	raw_uint16	Small positive ints (<= 65535)
0x04	delta_int32	Monotonic sequences — stores `v[i] - v[i-1]`
0x05	dict_uint8	≤ 256 unique int values, one byte per row
0x06	dict_uint16	≤ 65536 unique int values, two bytes per row
0x07	enum_uint8	≤ 256 unique strings each ≤ 255 chars; one byte per row
0x08	epoch_delta	Timestamps as seconds since a base epoch stored in the header
0x09	bitpack_bool	8 booleans per byte
0x0A	string_raw	Length-prefixed (uint16) UTF-8 strings, up to 65535 chars
0x0B	nullable_int32	Null bitmap (1 bit/row) + dense non-null int32 values
0x0C	nullable_int64	Null bitmap + dense non-null int64 values
0x0D	nullable_string	Null bitmap + dense string_raw values
0x0E	nullable_time	Null bitmap + dense epoch_delta values
0x10	rle	Run-length: `{value, count}` pairs. Rarely auto-selected.
0x11	for	Frame-of-reference: base + uint16 offsets for clustered ints
0x12	rle_dict	RLE over dict_uint8 indices
0x13	byte_split_32	Scatter int32 bytes into 4 streams for better zstd compression
0x14	byte_split_64	Scatter int64 bytes into 8 streams
0x15	array_int64	Null bitmap + per-row [len:u32] + flat int64 values (PG integer[]/bigint[])
0x16	array_string	Null bitmap + per-row [len:u32] + per-item [len:u16 + bytes] (PG text[])
0x17	jsonb_raw	Null bitmap + per-row [len:u32] + raw JSONB bytes

Encoding selection

The encoder picks automatically based on column content:

If any value is null → use a nullable variant
If all strings fit in 255 chars and ≤ 256 unique → enum_uint8
Else for strings → string_raw or nullable_string
For ints: delta_int32 for the id column when sorted, dict_uint8/16 for ≤ 65k unique values, byte_split_32/64 otherwise
For timestamps → epoch_delta or nullable_time
For []byte / json.RawMessage → jsonb_raw
For []int64/[]int32 → array_int64
For []string → array_string

Bundle layouts (.cbeboqtr, .cbeboyr)

[4]  magic ("BEQR" for qtr, "BEYR" for yr)
[1]  inner count (uint8, usually 3 or 4)
per inner:
  [7] label (UTF-8, NUL-padded to 7 bytes)
  [4] offset within payload (uint32 LE)
  [4] length (uint32 LE)
[4]  total payload size (uint32 LE)
[N]  payload — concatenated inner blobs (already zstd-compressed internally)
[4]  CRC32 of everything above

Bundles never double-compress. The inner .cbebomth / .cbeboqtr files are already zstd-compressed row-group-by-row-group, and re-compressing them at the bundle level added overhead for < 1% compression benefit.

DDL snapshots (.cbeboddl)

Plain UTF-8. No magic, no versioning, no compression. Just the CREATE ... statement. The assumption is that DDL is tiny (a few KB) and lasts forever — optimizing for bytes here doesn't matter.

CRC32 placement

The trailing CRC32 covers the entire file (not including itself). A "BEBO" magic sentinel sits in the 4 bytes before it so decoders can verify they're reading the trailer, not a false match mid-file.

Versioning

Byte 4 of the file is the format version. Current version: 0x02. Decoders accept older versions forever — breaking format changes would require a new magic ("BEBO" → something else) and a six-month migration window.

Reference implementation

The bebo CLI is the reference decoder: github.com/ceradela/bebo-cli. Any third-party implementation that passes bebo verify --deep on all test fixtures is considered spec-compliant.