Top Ten Reporters¶

This is a sample application using Scala that performs the following:

Reads the Original CSV into a Spark DataFrame
Performas a Query Count on Reporters Ordered Descending
Reports Top (10) by spot count

File Specs¶

The specs on the test file are:

Test File : wsprspots-2020-02.csv
Rows : 47,310,649 spots
File Size Decompressed : 3.964 GB

If you use a different archive, make sure to you pass the relative location to the script when running.

Build and Run¶

Run the following commands in order, and check your results.

#
# All commands are run from a terminal
#

cd ~/Downloads
wget -c http://wsprnet.org/archive/wsprspots-2020-02.csv.gz
gzip -dk wsprspots-2020-02.csv.gz

# set the path of the downloaded and extracted CSV file
csvfile=$PWD/wsprspots-2020-02.csv

# clone the repo
git clone https://github.com/KI7MT/wspr-analytics.git

# change directories and build the assembly
cd ./wspr-analytics/scala/TopTenReporters

# clean and build
sbt clean assembly

# Runs the following command
spark-submit --master local[8] target/scala-2.12/TopTenReporter-assembly-1.0.jar $csvfile

Results¶

You should get results similar to the following:

NOTE The time it takes will depend on your system resources (CPU, RAM, etc)

Application  : TopTenReporter
Process File : wsprspots-2020-02.csv
Tiimestame   : 2020-12-27 T 02:36:01.265
Description  : Returns the Top Ten Reporters Grouped by Count

Process Steps for this application
- Creating the Schema
- Reading CSV into DataSet
- Selecting Reporters
- GroupBy and Count Reporters
- Sort Reporters Descending
- Query Execution

+--------+------+
|Reporter| count|
+--------+------+
|   DK6UG|838081|
|  OE9GHV|690104|
|  EA8BFK|648670|
|   KD2OM|589003|
|KA7OEI-1|576788|
|   K4RCG|571445|
|     KPH|551690|
|    K9AN|480759|
|   DF5FH|480352|
|   DJ9PC|474211|
+--------+------+
only showing top 10 rows

Query Time : 5.821 sec