Optimising AWS Glue Efficiency: Enhancing Performance with File Consolidation Strategies

Shah Shweta
4 min readApr 20, 2024

--

In this digital era of technology, data is crucial role in almost all decision-making processes and the smooth running of multiple business operations. Day by day, with the increasing volume of data being generated, it becomes necessary to effectively manage and process this data to derive meaningful insights for business. This is where the AWS Glue comes into picture, which offers a server-less data integration service that further makes it easy to prepare and load data for analytics.

One frequent challenge faced by us is dealing with a great number of small files that can impact the performance of data processing jobs. Therefore, we will explore different approaches to how merging small files into larger ones using AWS Glue can significantly enhance job performance and reduce costs.

Merge multiple small files into larger ones for performance boosts and cost reductions in AWS Glue jobs. While processing data in AWS Glue, the number and size of files can have a remarkable impact on job performance and cost. A small number of files can lead to inefficiencies in data processing as each file requires overhead to read and process, resulting in longer job runtimes and increased extra costs. By combining these small files into larger ones, we can reduce the number of files processed, by that means improving performance and optimizing costs.

Merging multiple small files into larger ones offers several benefits:

1. Reduced overhead: Processing a large number of small files incurs overhead in terms of opening and closing files, fetching metadata, reading data, and managing other resources. By integrating small files into larger ones, the overhead is minimized, which leads to faster job implementation.

2. Improved parallelism: When processing large files, AWS Glue can use the supremacy of parallel processing capabilities to divide the workload across multiple nodes, resulting in improved performance.

3. Cost savings: With fewer files to process, we can reduce the amount of resources required for data processing in AWS Glue, leading to cost reduction and savings in terms of compute and storage.

How to merge small files into larger files using AWS Glue

1. groupFiles and groupSize parameters: To improve the performance of ETL tasks in AWS Glue, it is pivotal to configure our job parameters constructively. One important aspect to consider in this scenario is the grouping of files within an S3 data partition. Setting the groupFilesparameter to inPartition allows Glue to automatically group multiple input files. In addition, we can set the groupSize parameter to define the target size of groups in bytes.

For setting up parameters, follow these simple steps in the AWS Glue job:

1. Go to the AWS Glue console –> Tables, and then select the table you are working on.

2. Click on the “Edit Table” option to alter the table properties.

3. In the table properties, add the following parameter values:

a. groupFiles = inPartition

b. groupSize = 209715200
By setting the groupSize to 209715200, we are defining the grouping size of data in one partition to read as 200MB . This kind of configuration helps in enhancing the data speed of the partitioning for data processing.

After configuring the above grouping parameters, we can run an AWS Glue job and validate the impact on the data processing workflow.

If we are reading from AWS S3 directly using the create_dynamic_frame.from_options method, we can add these connection options.

For example, the following attempts to group files into 1 MB groups.

df = glueContext.create_dynamic_frame.from_options(“s3”, {‘paths’: [“s3://s3path/”], ‘recurse’:True, ‘groupFiles’: ‘inPartition’, ‘groupSize’: ‘1048576’}, format=”csv”)

For more information on groupFiles and groupSize parameters for reading input files in larger files, refer to the AWS Glue documentation.

AWS Glue jobs with the groupFiles and groupSize parameters can remarkably increase the performance and efficiency of data processing workflows. By fine-tuning these parameters, we can achieve faster execution speed and decrease the complexity of ETL tasks.

2. Coalesce or Repartition: To improve the efficiency of data writing, use repartition or Coalesce before saving it to S3.

What are coalesce and repartition?

Coalesce() — Coalesce uses already existing partitions to reduce the amount of data that’s shuffled. Coalesce results in partitions with different amounts of data, which means partitions that have much different sizes. It performs better than repartition while decreasing the number of partitions and giving better performance.

Repartition() — Repartition creates new partitions and does a full shuffle of data. Repartition results in roughly equal sized partitions comapred with coalesce. It’s generally suggested to use it while increasing the number of partitions, as it involves shuffling all the data.

Here, the catch is to adjust the partition size to decide the number of files for S3. We can achieve this by using functions like coalesce() or repartition(). To calculate the possible number of partitions suitable for data, a general formula can be applied.

For example, if our data input size is 1 GB and we require an output size of 100 MB for a file, the formula to follow is:

NumberOfPartitions = 1 Gb * 1000 Mb / 100 Mb,

resulting in 10 partitions.

For example, the following code explains how to use coalesce() or repartition().

df.coalesce(NumberOfPartitions).write.parquet(outputPath)

df.repartition(NumberOfPartitions).write.parquet(outputPath)

By managing the number of output files, we can productively manage file sizes for a data handling process.

In conclusion, merging small files into larger ones in AWS Glue can have a significant impact on job performance, a considerable reduction in runtime, and cost optimisation. By merging small files, we can streamline data processing, revamp parallelism, and achieve cost savings. As data continues to grow in volume and complexity, leveraging AWS Glue to proficiently manage and process data is essential for us to stay competitive in today’s data-driven environment.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Responses (1)

Write a response