Backstory

Your data exports, finally searchable — and entirely on your machine. A local-first explorer that turns Google Takeout and Telegram exports into one searchable timeline of your life, queryable from a CLI or any agent over MCP.

.NET 10 Local-first MCP Zero cloud

Overview

Every service offers "download your data," but it comes back as an unbrowsable pile of JSON/CSV. Backstory ingests those exports through per-source adapters into one normalized Event/Entity model, indexes it for hybrid semantic + keyword search, and exposes it to agents over MCP. Because this is the most personal data you own, everything runs locally with no API calls.

100%
ingestion coverage on supported data types
100%
Recall@5 with the ONNX semantic embedder

Architecture

The mess of each export format is quarantined inside a per-source adapter; everything downstream works on one clean schema. Layers, colour-coded:

flowchart TD TG["Telegram<br/>result.json"]:::src GT["Google Takeout<br/>JSON / CSV"]:::src TG --> AD GT --> AD AD["Source adapters<br/>detect · parse · normalize"]:::ingest NR["Normalizer<br/>Event + Entity schema"]:::ingest ER["Entity resolution<br/>link people & places"]:::ingest AD --> NR --> ER ER --> FTS[("SQLite + FTS5<br/>timeline · keyword")]:::store ER --> VEC[("Vector index<br/>cosine")]:::store ER --> RAW[("Raw blobs<br/>verbatim")]:::store FTS --> HQ VEC --> HQ HQ["Hybrid query<br/>semantic + keyword + filters"]:::query HQ --> CLI["CLI"]:::iface HQ --> MCP["MCP server → your agent"]:::iface classDef src fill:#faece7,stroke:#993c1d,color:#4a1b0c; classDef ingest fill:#eeedfe,stroke:#534ab7,color:#26215c; classDef store fill:#e1f5ee,stroke:#0f6e56,color:#04342c; classDef query fill:#e1f5ee,stroke:#0f6e56,color:#04342c; classDef iface fill:#f1efe8,stroke:#5f5e5a,color:#2c2c2a;

Projects

ProjectResponsibility
Backstory.CoreDomain types + interfaces (ISourceAdapter, IEmbeddingService, IVectorStore, IEventStore, IEntityStore)
Backstory.AdaptersTelegram + Google Takeout parsers
Backstory.StorageSQLite stores, FTS5, brute-force vectors
Backstory.EmbeddingsHashing + ONNX embedders, model downloader
Backstory.QueryIngestion pipeline + hybrid search
Backstory.McpMCP server + tools
Backstory.CliCommand-line entrypoint
Backstory.EvalBenchmark harness

Data flow: one search

A query fans out to the semantic and keyword retrievers in parallel and is fused by Reciprocal Rank Fusion — which ranks by position, so no score calibration between retrievers is needed.

sequenceDiagram participant A as Agent / CLI participant H as HybridSearch participant E as Embedder participant V as Vector index participant F as FTS5 A->>H: search("dinner with sarah") H->>E: embed(query) E-->>H: query vector par semantic H->>V: nearest vectors V-->>H: semantic hits and keyword H->>F: keyword match (terms OR-ed) F-->>H: keyword hits end H->>H: fuse (RRF) + apply filters H-->>A: ranked events (cross-source)

Data model

Two normalized shapes. Originals are kept verbatim on disk and referenced via RawRef, so nothing is ever lossy.

record Event {
    string Id;                 // content hash — re-imports dedupe
    DateTimeOffset Timestamp;  // UTC
    string Source;             // "telegram" | "google_takeout"
    string SubType;            // "telegram_message" | "search_query" | ...
    string Text;               // searchable content
    double? Latitude, Longitude;
    IReadOnlyList<string> ActorIds;  // people involved (entity ids)
    string? PlaceId;
    RawRef Raw;                // pointer to the original record
    JsonElement Metadata;
}

record Entity {
    string Id;
    EntityKind Kind;           // Person | Place | Org
    string CanonicalName;
    IReadOnlyList<string> Aliases;
}

Adapters

An adapter encapsulates one export's quirks and emits normalized items. The interface is tiny:

interface ISourceAdapter {
    string Source { get; }
    bool CanHandle(string path);
    IAsyncEnumerable<ImportItem> ParseAsync(string path, CancellationToken ct);
}
SourceData types (v1)Notes
Telegrammessages, contacts, sendersFull + single-chat JSON; text may be a string or an array of parts
Google TakeoutSearch, YouTube, Maps saves, Semantic Location HistoryEach type parsed defensively; a bad/absent file is skipped

Parsing is resilient by design: a malformed file is skipped rather than failing the whole import, and coordinate-format variance (geometry array vs. lat/lon fields) is handled inside the adapter.

Storage

One SQLite file per vault. Three concerns, one store each.

events(id, ts, source, subtype, text, lat, lon, place_id, raw_file, raw_locator, metadata)
event_actors(event_id, entity_id)
entities(id, kind, canonical_name, metadata)
entity_aliases(kind, alias, entity_id)        -- case-insensitive lookup
embeddings(event_id, vector BLOB)             -- float32
events_fts USING fts5(id, text)               -- keyword index

Vectors are stored as float32 BLOBs and searched with brute-force cosine over in-memory vectors — milliseconds at personal scale (10k–1M vectors), and zero extra dependencies. The IVectorStore interface allows swapping in sqlite-vec or HNSW if needed.

Embeddings

Two implementations behind one interface, both 384-dim, so they're interchangeable:

EmbedderHowQuality
Hashing defaultSigned feature hashing of word + char n-grams, L2-normalised. Zero assets, deterministic, offline.Lexical — matches shared words/chars
ONNX MiniLMall-MiniLM-L6-v2 via ONNX Runtime + BERT WordPiece tokenizer, mean-pooled, L2-normalised.Semantic — matches meaning

Run backstory model fetch once (~90 MB) and the factory selects ONNX automatically. It's the difference between "japan vacation" finding nothing and finding "flight to Tokyo" — and it lifts Recall@5 from 87.5% to 100%.

The query is embedded and run against the vector index; the same text runs against FTS5 with its terms OR-ed. The two ranked lists are fused with Reciprocal Rank Fusion:

score(doc) = Σ  1 / (k + rank_in_list)      // k = 60

Candidates are then filtered by date/source/subtype and the top results returned, each tagged semantic, keyword, or both.

CLI

CommandDoes
import <path>Auto-detect adapter and ingest an export
search "<query>"Hybrid search; --from --to --source --limit
timelineChronological events with filters
entity "<name>"Look up a person/place
statsCounts by source and type, active embedder
serveRun the MCP server over stdio
model fetchDownload the semantic model (opt-in)
evalRun the benchmark

Vault location: $BACKSTORY_DB or ~/.backstory/backstory.db.

MCP server

Built on the official ModelContextProtocol C# SDK, spoken over stdio. Register it with any MCP client:

{ "mcpServers": { "backstory": { "command": "backstory", "args": ["serve"] } } }
ToolPurpose
search_timelineNatural-language search over the timeline
get_eventsFull event records by id (incl. source pointer)
lookup_entityResolve a person/place by name
summarize_periodAll events in a range for the agent to summarize
list_sourcesSources ingested and their event counts

Benchmark

Reproducible with backstory eval — ingests bundled fixtures and measures coverage (emitted vs. present, surfacing silent loss) and Recall@5 over a hand-built question→gold set.

EmbedderIngestion coverageRecall@5
Hashing100%87.5%
ONNX MiniLM100%100%

Extending: a new adapter

  1. Implement ISourceAdapter in Backstory.Adapters.
  2. Emit EntityItem/EventItem with deterministic Ids.ContentHash ids.
  3. Register it in the CLI's adapter list.
  4. Add a fixture + a coverage assertion to the eval harness.

Nothing else changes — storage, search, and MCP are source-agnostic.

Privacy

100% local, no telemetry. The only network access in the entire project is the opt-in, one-time embedding-model download triggered by model fetch — never your data. The .gitignore also prevents vault databases, models, and exports from ever being committed.