Apache Spark Junit Performance Improvement

Vishakh Rameshan
Dec 30, 2020
2 min read

Updated: Jan 1, 2021

Recently I was working on a batch application that had series of Spark Jobs running complex business logic. I had written Junit test class for each of the business class and method in it. While testing individual classes everything worked swiftly taking couple of seconds to complete. After I developed an entire module, I did mvn clean install to run all the Junits to confirm Junit tests are passing before I commit my code to GitHub.

I found something very interesting while doing mvn clean install in my local machine. The entire module ran for 2 hrs. I thought this was due to my local machines configuration which could have lead this, so I committed to a feature branch and raised a pull request which triggered the Jenkins job. Not so surprising the Jenkins job took 3 hrs to complete as the jobs ran on worker pods in a Google Kubernetes Engine cluster.

As there were more modules being developed having business logics will eventually end up taking a day or more to complete the Jenkins pipeline to deploy the jars to Nexus. So, I decided to rectify this issue by finding the root cause for such a slowness in my application module before its merged to the development branch.

Following were my findings and the solution that I used which greatly reduced the time to run the entire module from 2 hr to 15mins and approx. 20mins in Jenkins

Make Spark Session Singleton

If you have multiple test classes for each business class, always use a single spark session, as the creation and deletion of spark session is very time consuming.

public class SessionBuilder {
  
  private SessionBuilder(){}
  
  public static SparkSession sparkSession = SparkSession
  .builder()
  .master("local[1]")
  .appName("test-app")
  .getOrCreate();
}

Override/Reduce Shuffle Partition

The default shuffle partition is 200 and it is static (does not change with the input dataset size). Having default shuffle partition could lead to issues

For smaller data, 200 is an overkill which often leads to slower processing because of scheduling overheads.
For larger data, 200 is small and doesn’t effectively use the all resources in the cluster.

In case of running test data, 200 is really a bad option as the input dataset size is very small and runs on a single core.

sparkSession.conf.set("spark.sql.shuffle.partitions", 1)

Avoid Dataset Show

Always it's better to avoid show() and Count() actions as spark does things lazy, having show(), count() leads to data being accumulated on a single worker node same as how we must avoid using group by operations as much as possible. But these are not hard rules as we cannot avoid them completely and there are scenarios where we need them.

In my case I needed show() and count() to be used for some business needs.

Use Persist() or Cache()

persist() and cache() are used to improve the performance as they allow the dataset to be saved in memory or on disk for further down the line instead of taking everything in memory

dataset.persist(StorageLevel.MEMORY_AND_DISK)

In my case for some of the business test class, I was using lots of show(), so I used persist() first and then show() which greatly reduced time.

Apache Spark Junit Performance Improvement

Recent Posts

Komentarze