Backstory

Your data exports, finally searchable — and entirely on your machine. A local-first explorer that turns Google Takeout and Telegram exports into one searchable timeline of your life, queryable from a CLI or any agent over MCP.

.NET 10 Local-first MCP Zero cloud

Overview

Every service offers "download your data," but it comes back as an unbrowsable pile of JSON/CSV. Backstory ingests those exports through per-source adapters into one normalized Event/Entity model, indexes it for hybrid semantic + keyword search, and exposes it to agents over MCP. Because this is the most personal data you own, everything runs locally with no API calls.

100%

ingestion coverage on supported data types

100%

Recall@5 with the ONNX semantic embedder

Architecture

The mess of each export format is quarantined inside a per-source adapter; everything downstream works on one clean schema. Layers, colour-coded:

flowchart TD TG["Telegram result.json"]:::src GT["Google Takeout JSON / CSV"]:::src TG --> AD GT --> AD AD["Source adapters detect · parse · normalize"]:::ingest NR["Normalizer Event + Entity schema"]:::ingest ER["Entity resolution link people & places"]:::ingest AD --> NR --> ER ER --> FTS[("SQLite + FTS5 timeline · keyword")]:::store ER --> VEC[("Vector index cosine")]:::store ER --> RAW[("Raw blobs verbatim")]:::store FTS --> HQ VEC --> HQ HQ["Hybrid query semantic + keyword + filters"]:::query HQ --> CLI["CLI"]:::iface HQ --> MCP["MCP server → your agent"]:::iface classDef src fill:#faece7,stroke:#993c1d,color:#4a1b0c; classDef ingest fill:#eeedfe,stroke:#534ab7,color:#26215c; classDef store fill:#e1f5ee,stroke:#0f6e56,color:#04342c; classDef query fill:#e1f5ee,stroke:#0f6e56,color:#04342c; classDef iface fill:#f1efe8,stroke:#5f5e5a,color:#2c2c2a;

Projects

Project	Responsibility
`Backstory.Core`	Domain types + interfaces (`ISourceAdapter`, `IEmbeddingService`, `IVectorStore`, `IEventStore`, `IEntityStore`)
`Backstory.Adapters`	Telegram + Google Takeout parsers
`Backstory.Storage`	SQLite stores, FTS5, brute-force vectors
`Backstory.Embeddings`	Hashing + ONNX embedders, model downloader
`Backstory.Query`	Ingestion pipeline + hybrid search
`Backstory.Mcp`	MCP server + tools
`Backstory.Cli`	Command-line entrypoint
`Backstory.Eval`	Benchmark harness

Data flow: one search

A query fans out to the semantic and keyword retrievers in parallel and is fused by Reciprocal Rank Fusion — which ranks by position, so no score calibration between retrievers is needed.

sequenceDiagram participant A as Agent / CLI participant H as HybridSearch participant E as Embedder participant V as Vector index participant F as FTS5 A->>H: search("dinner with sarah") H->>E: embed(query) E-->>H: query vector par semantic H->>V: nearest vectors V-->>H: semantic hits and keyword H->>F: keyword match (terms OR-ed) F-->>H: keyword hits end H->>H: fuse (RRF) + apply filters H-->>A: ranked events (cross-source)

Data model

Two normalized shapes. Originals are kept verbatim on disk and referenced via RawRef, so nothing is ever lossy.

record Event {
    string Id;                 // content hash — re-imports dedupe
    DateTimeOffset Timestamp;  // UTC
    string Source;             // "telegram" | "google_takeout"
    string SubType;            // "telegram_message" | "search_query" | ...
    string Text;               // searchable content
    double? Latitude, Longitude;
    IReadOnlyList<string> ActorIds;  // people involved (entity ids)
    string? PlaceId;
    RawRef Raw;                // pointer to the original record
    JsonElement Metadata;
}

record Entity {
    string Id;
    EntityKind Kind;           // Person | Place | Org
    string CanonicalName;
    IReadOnlyList<string> Aliases;
}

Adapters

An adapter encapsulates one export's quirks and emits normalized items. The interface is tiny:

interface ISourceAdapter {
    string Source { get; }
    bool CanHandle(string path);
    IAsyncEnumerable<ImportItem> ParseAsync(string path, CancellationToken ct);
}

Source	Data types (v1)	Notes
Telegram	messages, contacts, senders	Full + single-chat JSON; text may be a string or an array of parts
Google Takeout	Search, YouTube, Maps saves, Semantic Location History	Each type parsed defensively; a bad/absent file is skipped

Parsing is resilient by design: a malformed file is skipped rather than failing the whole import, and coordinate-format variance (geometry array vs. lat/lon fields) is handled inside the adapter.

Storage

One SQLite file per vault. Three concerns, one store each.

events(id, ts, source, subtype, text, lat, lon, place_id, raw_file, raw_locator, metadata)
event_actors(event_id, entity_id)
entities(id, kind, canonical_name, metadata)
entity_aliases(kind, alias, entity_id)        -- case-insensitive lookup
embeddings(event_id, vector BLOB)             -- float32
events_fts USING fts5(id, text)               -- keyword index

Vectors are stored as float32 BLOBs and searched with brute-force cosine over in-memory vectors — milliseconds at personal scale (10k–1M vectors), and zero extra dependencies. The IVectorStore interface allows swapping in sqlite-vec or HNSW if needed.

Embeddings

Two implementations behind one interface, both 384-dim, so they're interchangeable:

Embedder	How	Quality
Hashing default	Signed feature hashing of word + char n-grams, L2-normalised. Zero assets, deterministic, offline.	Lexical — matches shared words/chars
ONNX MiniLM	`all-MiniLM-L6-v2` via ONNX Runtime + BERT WordPiece tokenizer, mean-pooled, L2-normalised.	Semantic — matches meaning

Run backstory model fetch once (~90 MB) and the factory selects ONNX automatically. It's the difference between "japan vacation" finding nothing and finding "flight to Tokyo" — and it lifts Recall@5 from 87.5% to 100%.

Hybrid search

The query is embedded and run against the vector index; the same text runs against FTS5 with its terms OR-ed. The two ranked lists are fused with Reciprocal Rank Fusion:

score(doc) = Σ  1 / (k + rank_in_list)      // k = 60

Candidates are then filtered by date/source/subtype and the top results returned, each tagged semantic, keyword, or both.

CLI

Command	Does
`import <path>`	Auto-detect adapter and ingest an export
`search "<query>"`	Hybrid search; `--from --to --source --limit`
`timeline`	Chronological events with filters
`entity "<name>"`	Look up a person/place
`stats`	Counts by source and type, active embedder
`serve`	Run the MCP server over stdio
`model fetch`	Download the semantic model (opt-in)
`eval`	Run the benchmark

Vault location: $BACKSTORY_DB or ~/.backstory/backstory.db.

MCP server

Built on the official ModelContextProtocol C# SDK, spoken over stdio. Register it with any MCP client:

{ "mcpServers": { "backstory": { "command": "backstory", "args": ["serve"] } } }

Tool	Purpose
`search_timeline`	Natural-language search over the timeline
`get_events`	Full event records by id (incl. source pointer)
`lookup_entity`	Resolve a person/place by name
`summarize_period`	All events in a range for the agent to summarize
`list_sources`	Sources ingested and their event counts

Benchmark

Reproducible with backstory eval — ingests bundled fixtures and measures coverage (emitted vs. present, surfacing silent loss) and Recall@5 over a hand-built question→gold set.

Embedder	Ingestion coverage	Recall@5
Hashing	100%	87.5%
ONNX MiniLM	100%	100%

Extending: a new adapter

Implement ISourceAdapter in Backstory.Adapters.
Emit EntityItem/EventItem with deterministic Ids.ContentHash ids.
Register it in the CLI's adapter list.
Add a fixture + a coverage assertion to the eval harness.

Nothing else changes — storage, search, and MCP are source-agnostic.

Privacy

100% local, no telemetry. The only network access in the entire project is the opt-in, one-time embedding-model download triggered by model fetch — never your data. The .gitignore also prevents vault databases, models, and exports from ever being committed.