Implementing an S3 File Index with DynamoDB and S3 events
In one of our projects, we had a couple of thousands of JSON files stored in an AWS S3 which we needed to filter based on certain criteria. DynamoDB was an obvious solution for us in this case as we already knew the access patterns for this index. While searching for an optimal way to implement the delivery of file changes for this scenario we found an elegant event-based approach using S3 Events.
S3 events are generated when a file in an S3 bucket is created is updated. In other words, S3 events are triggered when a file in the bucket is created, updated or deleted. These events can be delivered to SNS, SQS, EventBridge or a Lambda function. They are designed to be delivered at least once for an event and no upper boundary for delivery time is guaranteed although they are usually delivered in seconds. These constraints mean that we need to implement the index operations to be idempotent and we need to be okay with eventual consistency which was the case for our scenario. Another limitation is that S3 events for a single prefix can only have a single receiver for S3 events.
S3 events delivered directly to the handler
The first iteration of building this file index was to deliver these notifications to the Lambda function directly and let it update the DynamoDB table with relevant index details.
S3 events delivered through a queue
Due to the frequently changing nature of the JSON files, we opted to deliver these events to an SQS queue first in the second iteration to decouple the event rate and lambda concurrency requirements.
S3 events will be buffered through the queue and we can set the concurrency limit for the lambda when we set the lambda as the consumer. For example, in our case, we opted to set 2 lambda maximum concurrency coupled with a buffer window of 1 minute and a buffer limit of 5 maximum messages.
S3 events delivered through an event bus
What if you want different handlers for different S3 prefixes? While this can be achieved by ignoring messages in the handlers themselves it is often inefficient as this approach consumes LambdaGB-seconds and concurrency unnecessarily.
This is where SNS topics come into play. SNS allow filter-based message delivery through subscription filters. Up until recently they only allowed filtering based on headers but now they support filtering based on message body as well.
This pattern has the additional advantage of delivering the same S3 event to multiple consumers if you want to do something other than with it (e.x. invalidate a cache path). While the same is technically possible with SQS as well it’s not considered best practice as ideally a queue should be used only for a single purpose only.
S3 events with fan-out pattern
We can combine the above two patterns to have buffered message delivery with different handlers for different paths.
Conclusion
S3 events are a useful tool to index files in S3 buckets in an event-driven manner. AWS DynamoDB (DDB)serves as a good indexing database if we know the access pattern for the index beforehand. Based on different destination options available for S3 events and indexing requirements we can formulate different event delivery pipelines.
If we want a simple setup with minimal delay in event delivery we can consider directly delivering S3 events to a lambda handler and let the handler update the DDB table.
If the files in the bucket are changing frequently and we want to decouple the lambda handler from S3 events we can choose to buffer the events using an SQS queue.
If we want to handle different S3 prefixes using different handlers or we want to deliver the S3 event to multiple handlers we can use SNS topic with SNS subscription filters.
S3 events are not without their limitations that are mentioned in the introduction of this article. Before considering using them you should consider whether those constraints are acceptable to your requirements.