Chapter 10 of 15

Databases

There is no one true database

I used to be a DBA in a past career, so I really care about databases. But you are going to eventually need a database whether you care about them or not, so it’s a good idea to know what you’re talking about.

If you want to get really in depth I once wrote a primer about what databases are and how they work. But you don’t need to.

The important thing to understand about databases is this: there is no database that is always good. They all have different trade-offs. They all suck under different conditions. There is no database you can pick, declare it your favorite, and use for every application. Learn about as many databases as you can, and think about the constraints you’re operating under before you pick one.

Memory

The first “database” to consider isn’t really a database at all — it’s just memory. You don’t always need a database. Sometimes you can just hold data in memory and leave it there. If your data doesn’t need to survive a restart, if it’s purely ephemeral, consider not writing it down at all.

IndexedDB

If your use case is “I’m writing a web app that needs to store data locally but doesn’t need it to persist on a server,” IndexedDB is excellent. It’s built into every browser. It’s like an object store — you put stuff in, you pull it back out, and you can query it. You can even access it from web workers. But it only works within a single browser, so if you need data to follow a user from their phone to their laptop, this isn’t your solution.

Redis

Redis is like memory but accessible from multiple machines. It stores anything you want — objects, functions, whatever — and it can be accessed from a whole fleet of front-end servers talking to one central database. It also persists to disk, but here’s the critical detail: it only persists every 200 milliseconds.

If Redis loses power, anything that happened in the final 200 milliseconds is gone forever — even if Redis told you the write succeeded. For most web apps, this doesn’t matter. For a financial application where you can’t afford to forget that someone sent money somewhere, it’s a deal-breaker. For a game where the last 200 milliseconds is when someone got a headshot that won the match, they’re going to be pissed.

Redis also has useful features like atomic operations, which make it great for queues and locks.

MongoDB

MongoDB is great for prototypes because it’s really simple to use and fast to get up and running. It claims to be a “schemaless” database, but it’s not really schemaless — the schema just lives in your code instead of the database. If you have two versions of your code running at the same time and they disagree about the schema, you suddenly have an irreconcilable data corruption problem.

I have never understood what people like about MongoDB, but it is undeniably popular, so I include it here.

MySQL / MariaDB

MySQL is what everybody thinks of when they think “relational database.” It’s what all other relational databases are measured against. It’s extremely easy to install and use, and it has most of the features you’ll need.

Fun fact about the name: MySQL is named after the creator’s son, My (a Swedish name, not the English possessive). He also has a daughter named Maria — hence MariaDB, the community fork created after Oracle acquired MySQL. He also has a son named Max - and created a database called MaxDB. The guy just names databases after his children.

Postgres

Postgres is better than MySQL in every way except that it’s harder to install. It has more features, it’s more scalable, and it has JSON as a first-class data type (so if the thing you liked about MongoDB was storing objects and pulling them back out, Postgres can do that too, but with proper indexes and geospatial queries on top).

For some reason, the Postgres team has been writing Postgres for 25+ years and it still hasn’t occurred to them to make installation easy and administration is still cryptic. So: get somebody else to install and administer Postgres for you, and then use it.

Oracle

Oracle is the granddaddy of databases. It does everything. It’s amazing. And it costs a minimum of six figures to get in the door.

But here’s the thing about working at a well-funded startup: sometimes you can be a hero by setting money on fire. Your team built this thing on Postgres and it’s too slow and you can’t make it scale? Spend a million bucks on Oracle and suddenly your web app is five times faster. Everyone calls you a hero. Sometimes that’s a real option.

Oracle also comes with a person — that’s part of what your six-figure investment buys. Someone shows up who knows how to make it work. That’s awesome. Expensive, but awesome.

Elasticsearch

Elasticsearch is not a general-purpose database — it’s a specialized search engine. It lets you do things like weighting headline matches over body text, and stemming so a search for “story” also finds “stories.” If you’re doing any kind of search, this is what you should use.

But never use Elasticsearch as your source of truth. Right there in the documentation, Elasticsearch tells you that occasionally it will freak out and drop everything. Keep the real data in a real database and use Elasticsearch as an incredibly flexible search cache.

The file system

Like memory, the file system isn’t technically a database, but it’s a perfectly good place to store data. It’s flexible, scalable, searchable, and easy to back up.

LinkedIn is a great example: when they needed to build the fastest possible message bus, they locked a bunch of engineers in a room and told them to build it. Those engineers could have invented a new database or a new protocol, but what they built — Kafka — just writes stuff to the file system and reads it back. Because file systems are incredibly good at caching, frequent reads, and all the things a message bus needs.

If you have a bunch of images or binary files, don’t put them in a database. Put them on the file system, or for durability, on something like Amazon S3 — which is slow but scales infinitely and has uptime guarantees that are borderline mythological.

Replication

All of these databases support some form of replication. If your one database is handling 10,000 queries per second and it’s on fire, but you look and see that only 10 of those queries are writes and the other 9,990 are reads — you can set up replicas. Write to one primary database, replicate to four read replicas, and suddenly each one is only doing 2,500 reads per second. Your database just got four times bigger without changing anything about your application.

Backups

Replication is not a backup strategy. If somebody accidentally runs DROP TABLE on your primary, that drop gets replicated to every replica. Suddenly all the data is gone from all the copies.

You must:

Have backups — whatever database you use.
Restore your backups regularly — not just make them, but actually test that they work. GitLab had eight hours of downtime once because they had four forms of backup and none of them worked when they needed them.
Secure your backups — they should be somewhere that even you can’t easily delete in a fit of anger. A company called Code Spaces went out of business overnight because someone got their AWS credentials and deleted every instance, every volume, every backup. The company woke up the next morning and it was gone. Your backups need to be somewhere that an attacker — or an accident — can’t reach.

And speaking of security, lets speak about security.