AWS S3 is a block storage space and for our projects, we have been using it both as a data lake as well as a storage space for our structured data. We very recently made use of the AWS S3 Select functionality. S3 Select gave us the ability to query our data in place and can help reduce data transfer costs on aws as you do not need to download an entire object. Apart from reduction of data transfer cost, the other major uplift has been on efficiency w.r.t time taken to get to the results. We are seeing closer to 300% uplift in efficiency.
With AWS S3 Select, you are able to get a few rows or a few data points instead of getting the entire object from storage and then doing the processing yourself. This helps you get efficiency because of lower bandwidth requirement and also lower processing as you can now directly get the result, instead of the object (from which you will then find the result).
What is AWS S3 Select?
In order to understand S3 Select, think of a scenario where you have an object residing in your AWS S3 bucket and for the sake of argument, let’s say it a csv file containing names of employees, their employee id , employee grades and salary per month. Let’s say you want to query all those employees who are part of a certain Employee Grade. Without S3 Select you will need to first download the entire file from S3, load it in your application and then pick up the records which meet your criteria (detailed bit more under the Application flow with AWS S3 section). With S3 Select however, you can write a simple SQL query and have the specific records returned to you. A two-step (or depending on the complexity of the architecture, a multi-step) process is reduced to a single step process because of the “Query in place” that we are able to achieve on the object residing within S3. The other S3 tools that can do Query in place (and are also recommended services to use if you wish to build an Analytics service) are AWS Redshift Spectrum and AWS Athena. S3 Select is designed more from selective retrieval of data from an object.
Application flow with AWS S3 (without Select)
While the actual application logic is fairly complex to detail out here, i am giving below a schematic representation of the Application we were using earlier. In our case, while we were using ec2 for application, once we started using S3 Select, we have now moved this entire setup to a Lambda function. This same python application for instance can be sitting anywhere (we also have processes running map reduce with very similar architecture abstraction)
- Download object from AWS S3.
- Read the object by loading it into a dataframe.
- Do the necessary value fetch and process next step.
Application Flow with AWS S3 Select
With AWS S3 Select, our flow has got simplified as follows
- We make a call to AWS S3 using the SELECT call.
- We pass the necessary SQL
- We get the result and we use that result.
So in short, no more downloading the file, extracting the file, processing it and then getting the result.
For S3 Select to work, your objects should be stored in a CSV, JSON or Apache Parquet format. One other very interesting thing is that S3 Select also supports compression using GZIP or BZIP2 on CSV and JSON objects. Compressed Parquet is not supported.
With Amazon S3 Select, you only retrieve the subset of data that you need from your object, you can expect to get better performance when objects are compressed and larger in size.
As per some data published by AWS, for a test CSV object of 133,975,755-byte (~128 MB) consisting of 1 million lines. In testing such a file, the file size was reduced by ~60% down to 50,308,104 bytes (~50.3 MBytes) with GZIP compression.
Using S3 Select with Python and AWS SDK
S3 Select is available through AWS SDK, AWS CLI and REST API. You can use the latest boto library and get started with it.
Accessing S3 Select using AWS Console
While its possible to use S3 Select using AWS SDK, Rest API and also the CLI, its also available as a feature on AWS Console. If you are using the AWS Console, you can extract upto 40 MB of records from an object which is 128 MB in size. Here are the steps you need to follow to use S3 Select from AWS Console
a. First head to the bucket where you have the object. In my case, i have created a demo bucket and a csv file isstored there. First select the file and then click on the menu item “Actions”.
b. Next select “Query with S3 Select” from the drop down.
c. You next specify your input and output settings. Our object is a csv file which is comma separated and has no compression. You will need to select appropriate settings for your object.
d. Next step after this is to write your SQL Query. Do note only SELECT is allowed. In this example here we have done a SELECT * with a limit of 5, which returns the first 5 results to the console.
Once you are happy with the results, you can also download the results as a CSV or JSON depending on what you selected under Output settings.
What kind of SQL does S3 Select support?
S3 Select supports standard SQL but there is no support for GROUPBY, JOIN. One thing you need to understand here is that AWS has provided this feature because probably their goal with SELECT is to help with data retrieval and not Analytics. Therefore if your goal is to do analytics directly, i will suggest either build your own pipeline or you try out Athena.
How are you Charged with AWS S3 Select?
You are charged for the amount of data scanned and the amount of data returned. One feature of S3 select that can help you, if you understand your data well is using selective scanning. With Selective Scanning, if you have a 500 megabyte file and you only scan for the first 50 megabytes, you will only be charged for 50 and not 500.
Originally published at https://www.techandproduct.com.