Creating a Search Engine in 2025

March 21, 2025

Building Search Infrastructure from Scratch

When I decided to build Cluster, the first thing I had to figure out wasn't the search algorithm — it was where all the data was going to live. A search engine needs to crawl, store, and index a massive amount of the web. We're talking petabytes of raw content, metadata, and index structures. Cloud costs for that kind of storage are astronomical, so I went a different route: I built the infrastructure myself.

Servers in the Basement

I set up a server rack in my basement. The logic was simple — if you need to store and process data at this scale on a startup budget, you either figure out the hardware yourself or you don't build the thing at all. I sourced rack-mount servers, wired up networking, and built out the storage layer from the ground up.

There's something grounding about hearing your search engine physically humming in the next room. Here's the first rack up and running:

Running your own hardware means you own every layer of the stack. No cloud provider rate-limiting your IOPS, no surprise bills when your crawler spikes throughput at 3 AM. But it also means you're the one replacing a failed drive on a Sunday morning.

ScyllaDB for the First Iteration

For the database layer, I went with ScyllaDB. When you're building a search engine, your database needs to handle millions of writes per second during crawls and serve low-latency reads when users query the index. Traditional relational databases buckle under that kind of workload.

ScyllaDB is a C++ rewrite of Cassandra designed for high-throughput, low-latency workloads on modern hardware. It lets you saturate your NVMe drives and network without the JVM overhead and GC pauses that plague Cassandra at scale. For the first iteration of Cluster's data layer, it was the right fit — schemaless enough to iterate fast on the data model, performant enough to keep up with an aggressive crawl schedule.

The write path was straightforward: the crawler feeds raw page content and extracted metadata into ScyllaDB, partitioned by domain. The read path serves the search index, pulling ranked results with single-digit millisecond latency even under concurrent query load.

MicroCeph for Object Storage

Raw HTML, screenshots, PDFs, cached assets — a search engine accumulates a lot of unstructured content that doesn't belong in a database. For that, I set up MicroCeph, a lightweight Ceph deployment that runs distributed object storage across the same physical servers.

Ceph gives you S3-compatible object storage that scales horizontally — add more drives, add more nodes, and the cluster rebalances automatically. MicroCeph strips away the operational complexity of a full Ceph deployment and makes it manageable for a small team. I configured it with erasure coding for storage efficiency — you get redundancy without tripling your disk usage.

The combination of ScyllaDB for structured data and MicroCeph for object storage gave Cluster its first complete storage architecture. Fast key-value access for the search index, cheap bulk storage for everything else, all running on hardware I control in my basement.

What I Learned

Building infrastructure this way forces you to understand every trade-off. You learn why certain design decisions exist in distributed systems — not from a textbook, but from watching a replication factor of 2 lose data when two drives fail in the same week. You learn that network bandwidth between racks matters more than raw CPU for a crawler workload. You learn that the "boring" parts of infrastructure — monitoring, alerting, backups — are the parts that actually keep the system alive.

This was just the first iteration. The architecture has evolved since then, but every decision I've made since has been informed by what I learned running those first servers in my basement.