Skip to main content

Command Palette

Search for a command to run...

Can You Use Amazon S3 as a Database (storage backend)?

Published
7 min read
F
A freelance engineer with combined 15 years of experience in software and cloud engineering with many other skills picked up along the way.

I wrote this piece as an answer initially in 2024 on Stackoverflow for a user wondering if they can use Wasabi(an S3 compatible service without request fees) as a key-value distributed database but also include efficient searching, and asking about serialization strategies and how to achieve transactions and the design considerations. I personally found the question to a good execerise to answer, although the community decided the question was not focused enough and ended up deleted. I have edited the answer slightly to fit better with blog format.

At its core, Amazon S3 is a very primitive key-value store. In a single operation, you can generally only do three things:

  • List keys
  • Set the value for a key
  • Get the value for a key (possibly only a specific byte range)

Furthermore, the value itself has to be serialized as a string or bytes.

With that in mind, you can absolutely use S3 (or S3-compatible APIs) as a database, a datastore, or a storage backend. However, as with any architecture, you have to weigh cost, performance, efficiency, ease of implementation, reliability, and quotas. It is crucial to understand your requirements and the scale you have in mind before treating an object store like a traditional database.

Cost and Efficiency at Scale

If your requirement is to read or write one key a day, almost any architecture will work fine. But if you need to read 1,000,000 small different values a second, using S3 directly as a database will quickly become costly and inefficient compared to other approaches. Too many requests would might even overwhelm S3 or Wasabi endpoints or whatever else is in between.

Keep in mind that while Amazon S3 charges you per request, alternative providers like Wasabi might not. But that doesn't mean those requests are "free." An HTTPS request requires CPU, memory, and network bandwidth to prepare and send out—resources your application has to pay for. These requests also introduce latency, forcing your client-side application to wait while consuming memory. Any service such as Wasabit will also have limits and quotas, they just may not have shared it with you.

For designing any database, if one understands read and write patterns, they can make it as efficient as possible. For example, if users frequently request a specific group of keys together, it makes logical sense to store them together so they can be read in a single request. Things will never be perfect, but you can statistically optimize your storage so that the majority of requests are efficient.

The Problem with One-File-Per-Key

In the database world, you could design a system that uses one file per key/value pair. But modern databases don't do this because it's terribly inefficient for certain operations. Instead, traditional databases keep your data in larger files and structure the internal data for efficient access using indices, data ordering, and tree structures to quickly locate a value.

Such tricks can also be done with S3 as a storage backend, but with an understanding that requests to access this storage backend are more expensive than traditional databases/datastores. S3 would be suitable as soon as the read/write patterns would match its pricing.

A common architecture pattern with S3 that requires query-ablity, search-ablity and transactions but efficient for large amount of data would be the data lake architecture using table formats such Iceberg to organize the data. Where data is partitioned and ordered depending on data read patterns while structured in a way that least amount of data needs to be read for each query. Writes can also be optimized depending on the patterns.

The original questions

These were the questions in the original Stackoverflow question that tried directly after the above introduction.

1. Algorithms, Search, and Indexing

How do you implement a simple key search (e.g., search for a user starting with "j" or fav_color "red") using only basic GET and PUT interfaces?

To do this, you have to create sorted index files that map values back to their keys, much like a traditional database.

In typical data lake architectures, columnar storage formats like Parquet are incredibly popular for this. Parquet files contain metadata headers at the beginning of the files that define which value ranges live where. Within those sections, the column values are sorted. To search, your application makes one small request to read the initial metadata, determines exactly where the target data lives, and then makes one or two specific byte-range requests to fetch only the relevant section of the file.

Basically the idea is that we'll have store an index or metadata somewhere that gives a hint where we might find the answer to our query efficiently. This can be in S3 or in somewhere else.

2. Choosing a Serialization Strategy

What kind of serialization strategy for both primitive data types (String, Number, Boolean, etc) and Blob data (images, audio, video, and any sort of file) is best for this kind of key-value store? Also, this simple key-value does not have a way to define what type of value is stored in the key (is it a string, number, binary, etc?), how can that be solved?

Any format works, but your choice depends entirely on system requirements:

  • JSON is incredibly popular because it's human-readable and easy to debug. However, it is not space-efficient, and complex data types don't always map to it easily.

  • There are many Binary formats one can use. They may make it more efficient to read and more space efficeint, but suddenly you lose the human readablity, but in the end this is another trade-off you have decide on what's more important to you.

  • Columnar formats (like Parquet) give you massive querying benefits and strictly define data types, but they can be slightly larger in file size for small payloads and are more computationally expensive to serialize and write.

3. Handling Transactions

How can you achieve ACID-like transactions? (e.g., store a username ONLY if associated data is also stored).

The easiest workaround is to package the transactional data together into a single file/object with conditional writes. If it's stored as one unit, a failure means none of it is written, and a success means all of it is written.

If your storage backend supports strong read-after-write consistency (which S3 now does natively), you can implement custom locking mechanisms. You can write a "lock" key that signals a certain area of the datastore is being written to, forcing other reads/writes to wait until the operation finishes and the lock key is removed. However, doing this manually over HTTPS can be slow and expensive.

AWS Lake Formation offered transactional capabilities across the various ways you query data from S3(Athena, Glue, Redshift), which is usually a better route than building custom locks, however it is now deprecated.

Table formats such as Iceberg also offer transactions.

Final Design Considerations

If you plan to use S3 as your main database for a production application, always think of the requirements, costs and contstraints. Keep a close eye on the operations you are executing and the associated costs—both on the S3 billing side and the compute overhead on your client side. Map out your read/write patterns beforehand. S3 is a highly reliable and infinitely scalable storage solution, but it only acts as a performant database if you design your data structures and read/write patterns explicitly for it.