Spark File Format Conversion Utility
- Vishakh Rameshan
- Jan 3, 2021
- 1 min read
I am a Data Engineer performing business operations on Big Data files of various formats like Avro, Parquet, CSV, Text etc on HDFS and the processed data is made available to Data Scientists or Data Stewards.
Writing Spark Job is easy compared with testing it and creating mock test data. It becomes even more harder when the input and output files need to be encrypted and decrypted. As you all know Avro and Parquet files are non readable, so having mock data created in csv/txt and later converting to Avro or Parquet and if the test cases fail, converting back the input files or editing csv files and again converting back to avro is a tedious job.
So to make my work and my colleagues work easy, I have created a utility that independently runs as spring boot with spark integrated and having a Swagger UI to interact with.
GitHub Code - https://github.com/Hitman007IN/Spark_Conversion
Currently supports the following conversions:-
PARQUET Conversion
TEXT to PARQUET
CSV to PARQUET
AVRO to PARQUET
CSV Conversion
TEXT to CSV
AVRO to CSV
PARQUET to CSV
AVRO Conversion
TEXT to AVRO
CSV to ARVO
PARQUET to AVRO
TEXT Conversion
CSV to TEXT
AVRO to TEXT
PARQUET to TEXT
Interact with Swagger UI

This utility does not support complex Avro Schemas
Something that I was looking for. Thank you