Convert CSV File to Parquet¶
This is a sample Application using Scala that performs the following:
- Reads a WSPRnet CSV from an input path e.g /data/wspr/csv.wsprspots-2020-02.csv
- Creates a Parquet file set to an output path e.g /data/wspr/parquet/2020/02
- If you re-run the script, the output Parquet directory
will be
overwritten.
File Specs¶
The specs on the test file are:
- Test File : wsprspots-2020-02.csv
- Rows : 47,310,649 spots
- File Size Decompressed : 3.964 GB
Build and Run¶
Run the following commands in order, and check your results.
#
# All commands are run from a terminal
#
# change the download location to whatever you prefer
cd ~/Downloads
wget -c http://wsprnet.org/archive/wsprspots-2020-02.csv.gz
gzip -dk wsprspots-2020-02.csv.gz
# set the path of the downloaded and extracted CSV file
infile=$PWD/wsprspots-2020-02.csv
outdir=$PWD/wspr/parquet/2020/02
# clone the repo
git clone https://github.com/KI7MT/wspr-analytics.git
# change directories and build the assembly
cd ./wspr-analytics/scala/ConvertCsvToParquet
# clean and build
sbt clean assembly
# Run the following command
# NOTE : set local[16] to half of your total CPU count.
spark-submit --master local[16] target/scala-2.12/ConvertCsvToParquet-assembly-1.0.jar $infile $outdir
Results¶
You should get results similar to the following:
- Out Directory $PWD/wspr/parquet/2020/02
- Compressed Size ~615 MB on-disk
- Process Time was =< 21sec
The example below will differ somewhat due to my CSV input and output choices.
NOTE The time it takes will depend on your system resources (CPU, RAM, etc)
Object : ConvertCsvToParquet
Process File : /data/wspr/raw/csv/wsprspots-2020-02.csv
File Out Path : /data/wspr/raw/parquet/2020/02
Tiimestame : 2020-12-28 T 04:36:29.941
Description : Convert CSV to Parquet
Process Steps to Create Parquet File(s)
- Create a Spark Session
- Create the Spot Schema
- Read the CSV file into a DataSet
- Write Parquet File(s), please wait...
Elapsed Time : 20.456 sec
Finished