Controlling Expenses on Cloud Storage for Large-Scale Data Applications

In today's data-driven world, companies across various sectors rely on high-capacity and highly scalable data storage solutions, such as Amazon S3, Google Cloud Storage, and Azure Blob Storage. These modern data storage services have become indispensable, especially for projects involving AWS Glue, Databricks, and cloud migrations widely implemented by firms including UBS, Commerzbank, and others working with AI, customer data platforms, and real-time analytics.

One of the key considerations when using these cloud-based storage services is understanding the cost factors. For instance, Amazon S3 includes six cost components that need to be taken into account: the overall size of the storage space used, activities such as transferring data into, out of, or within cloud storage, and various API operations.

When it comes to large files, Amazon S3 recommends using multi-part upload for files larger than 100 MB. However, using multi-part data transfer can potentially limit cost savings, as it results in being charged for the API operation of each file part. This is because many APIs, such as Boto3, use multi-part downloading by default, which may not work in favor of cost-saving if the primary concern is limiting cost.

A compromise you might consider is to upload large files with grouped samples while enabling access to individual samples by maintaining an index file with the locations of each sample. This technique would save money not just on the PUT and GET calls but on all cost components of Amazon S3 that are dependent on the number of object files rather than the overall size of the data.

Let's consider a data transformation application that acts on 1 billion data samples. A simple calculation shows that Amazon S3 API calls alone can tally a bill of $5,400. However, by grouping the samples into 2 million files, each with 500 samples, and applying the transformation without multi-part data transfer, the cost of the API calls can be reduced to $10.8. That's a significant cost saving of up to 98.4%.

In multi-part data transfer, files are divided into multiple parts that are transferred concurrently to speed up data transfer of large files. This technique would save money not just on the PUT and GET calls but on all cost components of Amazon S3 that are dependent on the number of object files rather than the overall size of the data.

In conclusion, optimising cost in cloud storage requires understanding all potential cost factors and designing data storage that takes into account all factors, specific data needs, and usage patterns. By grouping samples together into files of a larger size and running the transformation on batches of samples, companies can significantly reduce their cloud storage costs.