Skip to content

Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

-- Apache Spark Project

Apache Spark is written in a programming language called Scala, which runs on a Java Virtual Machine (JVM). To allow for an easy integration between Scala and Python, the programming community created a tool set (libraries/modules/scripts) called PySpark. Other languages such as Ruby, R, and Java also have bindings that support or work with the Spark framework.

It can get a bit confusing trying to distinguish the manay moving parts of the Spark echosystem. To make it a bit easier, just remember Spark is the foundation, and PySpark in merely an integration to that foundation.

For the purposes of the WSPR Analytics project, we'll being using PySpark as a easy gateway to Jupyter Notebooks in a distributive manner. As with Spark, PySpark has it's own shell when not being forwarded to Jupyter Notebooks. However, both still rely on Spark as the compute engine, and in this case, Python is the interface via the command-line rather than web page.

You can see by the shell below, I was running Anaconda Python v3.7.9 from with Spark v3.0.1. You can also see that the pyspark shell creates a SparkContext for us upon entry. This will be better understood when executng example scripts.

Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Python version 3.7.9 (default, Aug 31 2020 12:42:55)
SparkSession available as 'spark'.

>>> spark.version
'3.0.1'

>>> sc.version
'3.0.1'

Additional Resources

There are many high-quailty sites that cover all aspects of PySpark in great detail: