Skip to content

Query Column Using Parquet Files

This is a sample Application using Scala that performs the following:

  • Reads a Parquet folder specificed by the user e.g. /data/wspr/raw/parquet/2020/02
  • Performs a Query Count on a column specified by the user e.g. Reporter
  • Reports GoubyBy Count and returns the Top (10)

File Specs

The specs on the test file are:

  • Test File : wsprspots-2020-02.csv
  • Rows : 47,310,649 spots
  • File Size Decompressed : 3.964 GB

Build and Run

Run the following commands in order, and check your results.

#
# All commands are run from a terminal
#

# NOTE: You must have previously run ConvertCvsToParquet before
#       using this app, as it is looking for a Parquet folder,
#       not a CSV file.

# set the path to where you created the Parquet partition
inFolder="/data/wspr/raw/parquet/2020/02"

# set the colum you wish to GroupBy and Count
column="Reporter"

# clone the repo
git clone https://github.com/KI7MT/wspr-analytics.git

# change directories and build the assembly
cd ./wspr-analytics/scala/QueryColumnParquet

# clean and build
sbt clean assembly

# Run the following command
# NOTE : set local[8] to half of your total CPU count. 
spark-submit --master local[8] target/scala-2.12/QueryColumnParquet-assembly-1.0.jar $inFolder $column

Results

You should get results similar to the following:

NOTE The time it takes will depend on your system resources (CPU, RAM, etc)

Application   : QueryColumnParquet
Folder        : /data/wspr/raw/parquet/2020/02
Column        : Reporter
Tiimestame    : 2020-12-28T07:36:35.389
Description   : Query Column and Count using Parquet Folders

Process Steps to Query Reporter from Parquet Files(s)
- Create a Spark Session
- Read Parquet File(s)
- Select Reporter
- GroupBy and Count Reporter
- Sort Reporter Descending
- Execute the Query

+--------+------+
|Reporter| count|
+--------+------+
|   DK6UG|838081|
|  OE9GHV|690104|
|  EA8BFK|648670|
|   KD2OM|589003|
|KA7OEI-1|576788|
|   K4RCG|571445|
|     KPH|551690|
|    K9AN|480759|
|   DF5FH|480352|
|   DJ9PC|474211|
+--------+------+
only showing top 10 rows

Elapsed Time : 2.843 sec

Query by Reporter

For Comparrison, here is the data from 2020-11. November 2020 had over 70 Million reports.

Application   : QueryColumnParquet
Folder        : /data/wspr/raw/parquet/2020/11
Column        : Reporter
Tiimestame    : 2020-12-28 T 07:42:09.275
Description   : Query Column and Count using Parquet Folders

Process Steps to Query Reporter from Parquet Files(s)
- Create a Spark Session
- Read Parquet File(s)
- Select Reporter
- GroupBy and Count Reporter
- Sort Reporter Descending
- Execute the Query

+--------+-------+
|Reporter|  count|
+--------+-------+
|  EA8BFK|1120739|
|  OE9GHV|1103335|
|   WA2TP| 847124|
|   KD2OM| 834896|
|  IW2NKE| 818315|
|   LX1DQ| 803347|
|   DK6UG| 794675|
|KA7OEI-1| 748356|
|   ON5KQ| 744727|
|  OE9HLH| 733580|
+--------+-------+
only showing top 10 rows

Elapsed Time : 2.8 sec

Query by CallSign

This query is from the CallSign column.

Application   : QueryColumnParquet
Folder        : /data/wspr/raw/parquet/2020/11
Column        : CallSign
Tiimestame    : 2020-12-28 T 07:44:16.711
Description   : Query Column and Count using Parquet Folders

Process Steps to Query CallSign from Parquet Files(s)
- Create a Spark Session
- Read Parquet File(s)
- Select CallSign
- GroupBy and Count CallSign
- Sort CallSign Descending
- Execute the Query

+--------+------+
|CallSign| count|
+--------+------+
|   DK2DB|662784|
|   K4APC|600895|
|   ON7KO|557477|
|   KD6RF|541418|
|  WA4KFZ|531006|
|   W6LVP|466346|
|   DL6NL|438741|
|   N8VIM|438260|
|   DK8JP|403667|
|   G0CCL|386570|
+--------+------+
only showing top 10 rows

Elapsed Time : 2.445 sec

Query by Version

Application   : QueryColumnParquet
Folder        : /data/wspr/raw/parquet/2020/11
Column        : Version
Tiimestame    : 2020-12-28 T 07:50:17.725
Description   : Query Column and Count using Parquet Folders

Process Steps to Query Version from Parquet Files(s)
- Create a Spark Session
- Read Parquet File(s)
- Select Version
- GroupBy and Count Version
- Sort Version Descending
- Execute the Query

+---------+--------+
|  Version|   count|
+---------+--------+
|     null|27015499|
| 1.3 Kiwi|13207118|
|    2.2.2|11199415|
|    2.1.2| 5906973|
|2.3.0-rc1| 2952306|
|    2.1.0| 2602903|
|2.3.0-rc2| 2366520|
|0.9_r4178| 2033622|
|    2.2.1| 2027954|
|    2.0.0| 1482597|
+---------+--------+
only showing top 10 rows

Elapsed Time : 2.492 sec