Backstory
Your data exports, finally searchable — and entirely on your machine. A local-first explorer that turns Google Takeout and Telegram exports into one searchable timeline of your life, queryable from a CLI or any agent over MCP.
.NET 10 Local-first MCP Zero cloud
Overview
Every service offers "download your data," but it comes back as an unbrowsable pile of JSON/CSV.
Backstory ingests those exports through per-source adapters into one normalized
Event/Entity model, indexes it for hybrid semantic + keyword
search, and exposes it to agents over MCP. Because this is the most
personal data you own, everything runs locally with no API calls.
Architecture
The mess of each export format is quarantined inside a per-source adapter; everything downstream works on one clean schema. Layers, colour-coded:
Projects
| Project | Responsibility |
|---|---|
Backstory.Core | Domain types + interfaces (ISourceAdapter, IEmbeddingService, IVectorStore, IEventStore, IEntityStore) |
Backstory.Adapters | Telegram + Google Takeout parsers |
Backstory.Storage | SQLite stores, FTS5, brute-force vectors |
Backstory.Embeddings | Hashing + ONNX embedders, model downloader |
Backstory.Query | Ingestion pipeline + hybrid search |
Backstory.Mcp | MCP server + tools |
Backstory.Cli | Command-line entrypoint |
Backstory.Eval | Benchmark harness |
Data flow: one search
A query fans out to the semantic and keyword retrievers in parallel and is fused by Reciprocal Rank Fusion — which ranks by position, so no score calibration between retrievers is needed.
Data model
Two normalized shapes. Originals are kept verbatim on disk and referenced via RawRef,
so nothing is ever lossy.
record Event {
string Id; // content hash — re-imports dedupe
DateTimeOffset Timestamp; // UTC
string Source; // "telegram" | "google_takeout"
string SubType; // "telegram_message" | "search_query" | ...
string Text; // searchable content
double? Latitude, Longitude;
IReadOnlyList<string> ActorIds; // people involved (entity ids)
string? PlaceId;
RawRef Raw; // pointer to the original record
JsonElement Metadata;
}
record Entity {
string Id;
EntityKind Kind; // Person | Place | Org
string CanonicalName;
IReadOnlyList<string> Aliases;
}
Adapters
An adapter encapsulates one export's quirks and emits normalized items. The interface is tiny:
interface ISourceAdapter {
string Source { get; }
bool CanHandle(string path);
IAsyncEnumerable<ImportItem> ParseAsync(string path, CancellationToken ct);
}
| Source | Data types (v1) | Notes |
|---|---|---|
| Telegram | messages, contacts, senders | Full + single-chat JSON; text may be a string or an array of parts |
| Google Takeout | Search, YouTube, Maps saves, Semantic Location History | Each type parsed defensively; a bad/absent file is skipped |
Parsing is resilient by design: a malformed file is skipped rather than failing the whole import, and coordinate-format variance (geometry array vs. lat/lon fields) is handled inside the adapter.
Storage
One SQLite file per vault. Three concerns, one store each.
events(id, ts, source, subtype, text, lat, lon, place_id, raw_file, raw_locator, metadata)
event_actors(event_id, entity_id)
entities(id, kind, canonical_name, metadata)
entity_aliases(kind, alias, entity_id) -- case-insensitive lookup
embeddings(event_id, vector BLOB) -- float32
events_fts USING fts5(id, text) -- keyword index
Vectors are stored as float32 BLOBs and searched with brute-force cosine over in-memory vectors —
milliseconds at personal scale (10k–1M vectors), and zero extra dependencies. The
IVectorStore interface allows swapping in sqlite-vec or HNSW if needed.
Embeddings
Two implementations behind one interface, both 384-dim, so they're interchangeable:
| Embedder | How | Quality |
|---|---|---|
| Hashing default | Signed feature hashing of word + char n-grams, L2-normalised. Zero assets, deterministic, offline. | Lexical — matches shared words/chars |
| ONNX MiniLM | all-MiniLM-L6-v2 via ONNX Runtime + BERT WordPiece tokenizer, mean-pooled, L2-normalised. | Semantic — matches meaning |
Run backstory model fetch once (~90 MB) and the factory selects ONNX automatically.
It's the difference between "japan vacation" finding nothing and finding
"flight to Tokyo" — and it lifts Recall@5 from 87.5% to 100%.
Hybrid search
The query is embedded and run against the vector index; the same text runs against FTS5 with its terms OR-ed. The two ranked lists are fused with Reciprocal Rank Fusion:
score(doc) = Σ 1 / (k + rank_in_list) // k = 60
Candidates are then filtered by date/source/subtype and the top results returned, each tagged
semantic, keyword, or both.
CLI
| Command | Does |
|---|---|
import <path> | Auto-detect adapter and ingest an export |
search "<query>" | Hybrid search; --from --to --source --limit |
timeline | Chronological events with filters |
entity "<name>" | Look up a person/place |
stats | Counts by source and type, active embedder |
serve | Run the MCP server over stdio |
model fetch | Download the semantic model (opt-in) |
eval | Run the benchmark |
Vault location: $BACKSTORY_DB or ~/.backstory/backstory.db.
MCP server
Built on the official ModelContextProtocol C# SDK, spoken over stdio. Register it
with any MCP client:
{ "mcpServers": { "backstory": { "command": "backstory", "args": ["serve"] } } }
| Tool | Purpose |
|---|---|
search_timeline | Natural-language search over the timeline |
get_events | Full event records by id (incl. source pointer) |
lookup_entity | Resolve a person/place by name |
summarize_period | All events in a range for the agent to summarize |
list_sources | Sources ingested and their event counts |
Benchmark
Reproducible with backstory eval — ingests bundled fixtures and measures coverage
(emitted vs. present, surfacing silent loss) and Recall@5 over a hand-built question→gold set.
| Embedder | Ingestion coverage | Recall@5 |
|---|---|---|
| Hashing | 100% | 87.5% |
| ONNX MiniLM | 100% | 100% |
Extending: a new adapter
- Implement
ISourceAdapterinBackstory.Adapters. - Emit
EntityItem/EventItemwith deterministicIds.ContentHashids. - Register it in the CLI's adapter list.
- Add a fixture + a coverage assertion to the eval harness.
Nothing else changes — storage, search, and MCP are source-agnostic.
Privacy
100% local, no telemetry. The only network access in the entire project is the opt-in,
one-time embedding-model download triggered by model fetch — never your data.
The .gitignore also prevents vault databases, models, and exports from ever being committed.