BEBO format specification v2.1

The on-disk format for Ceradela cold archives. This document is authoritative — any divergence between this spec and the reference decoder is a bug in one or the other, and we'll fix.

Overview

BEBO is a columnar archive format. A file contains rows from one table grouped into row groups, each compressed independently with zstd. Per-column metadata lives in the header; per-column byte offsets enable O(1) column pruning on read.

File extensions

ExtensionWhat it is
.cbebomthMonthly archive — one month of one table's rows
.cbeboqtrQuarterly bundle — 3 monthly .cbebomth files wrapped in a TOC
.cbeboyrYearly bundle — 4 quarterly .cbeboqtr files wrapped in a TOC
.cbeboddlRaw DDL snapshot — plain UTF-8 CREATE statement

Monthly file layout (.cbebomth)

offset   bytes   content
──────   ─────   ─────────────────────────────
0        4       "BEBO" magic
4        1       version (currently 0x02)
5        1       flags (reserved, 0x00)
6        4       header length (little-endian uint32)
10       N       header (zstd-compressed)

10+N     varies  row groups (each: 4-byte compLen + compressed payload)

end-12   4       footer length (little-endian uint32)
end-8    4       "BEBO" magic (trailer sentinel)
end-4    4       CRC32 of everything from offset 0 to end-4

Header section (after zstd decompression)

"BL2P" magic (4 bytes)          — inner format tag
row count (uint32 LE)
column count (uint16 LE)

Per column:
  encoding tag (uint8)
  name length (uint8)
  name bytes (UTF-8)
  data offset (uint32 LE) — where this column's bytes start in the row group payload
  data length (uint32 LE) — byte span of this column's data
  encoding-specific metadata (see encoding table)

After columns:
  header_len (uint32 LE)       — byte offset where data section starts
  data section                 — column bytes, one block per column

Encoding tags

TagNameWhat it encodes
0x01raw_int32Sequential int32 values (4 bytes each)
0x02raw_int64Sequential int64 values (8 bytes each)
0x03raw_uint16Small positive ints (<= 65535)
0x04delta_int32Monotonic sequences — stores v[i] - v[i-1]
0x05dict_uint8≤ 256 unique int values, one byte per row
0x06dict_uint16≤ 65536 unique int values, two bytes per row
0x07enum_uint8≤ 256 unique strings each ≤ 255 chars; one byte per row
0x08epoch_deltaTimestamps as seconds since a base epoch stored in the header
0x09bitpack_bool8 booleans per byte
0x0Astring_rawLength-prefixed (uint16) UTF-8 strings, up to 65535 chars
0x0Bnullable_int32Null bitmap (1 bit/row) + dense non-null int32 values
0x0Cnullable_int64Null bitmap + dense non-null int64 values
0x0Dnullable_stringNull bitmap + dense string_raw values
0x0Enullable_timeNull bitmap + dense epoch_delta values
0x10rleRun-length: {value, count} pairs. Rarely auto-selected.
0x11forFrame-of-reference: base + uint16 offsets for clustered ints
0x12rle_dictRLE over dict_uint8 indices
0x13byte_split_32Scatter int32 bytes into 4 streams for better zstd compression
0x14byte_split_64Scatter int64 bytes into 8 streams
0x15array_int64Null bitmap + per-row [len:u32] + flat int64 values (PG integer[]/bigint[])
0x16array_stringNull bitmap + per-row [len:u32] + per-item [len:u16 + bytes] (PG text[])
0x17jsonb_rawNull bitmap + per-row [len:u32] + raw JSONB bytes

Encoding selection

The encoder picks automatically based on column content:

  1. If any value is null → use a nullable variant
  2. If all strings fit in 255 chars and ≤ 256 unique → enum_uint8
  3. Else for strings → string_raw or nullable_string
  4. For ints: delta_int32 for the id column when sorted, dict_uint8/16 for ≤ 65k unique values, byte_split_32/64 otherwise
  5. For timestamps → epoch_delta or nullable_time
  6. For []byte / json.RawMessagejsonb_raw
  7. For []int64/[]int32array_int64
  8. For []stringarray_string

Bundle layouts (.cbeboqtr, .cbeboyr)

[4]  magic ("BEQR" for qtr, "BEYR" for yr)
[1]  inner count (uint8, usually 3 or 4)
per inner:
  [7] label (UTF-8, NUL-padded to 7 bytes)
  [4] offset within payload (uint32 LE)
  [4] length (uint32 LE)
[4]  total payload size (uint32 LE)
[N]  payload — concatenated inner blobs (already zstd-compressed internally)
[4]  CRC32 of everything above

Bundles never double-compress. The inner .cbebomth / .cbeboqtr files are already zstd-compressed row-group-by-row-group, and re-compressing them at the bundle level added overhead for < 1% compression benefit.

DDL snapshots (.cbeboddl)

Plain UTF-8. No magic, no versioning, no compression. Just the CREATE ... statement. The assumption is that DDL is tiny (a few KB) and lasts forever — optimizing for bytes here doesn't matter.

CRC32 placement

The trailing CRC32 covers the entire file (not including itself). A "BEBO" magic sentinel sits in the 4 bytes before it so decoders can verify they're reading the trailer, not a false match mid-file.

Versioning

Byte 4 of the file is the format version. Current version: 0x02. Decoders accept older versions forever — breaking format changes would require a new magic ("BEBO" → something else) and a six-month migration window.

Reference implementation

The bebo CLI is the reference decoder: github.com/ceradela/bebo-cli. Any third-party implementation that passes bebo verify --deep on all test fixtures is considered spec-compliant.