Apache Spark
Overview
Spark is a unified analytics engine for large-scale data processing. If you need to process data at scale, Spark is a fast and efficient tool to do that.
What is it?
Spark is a big data processing platform and is the de facto successor to the moribund Hadoop platform. Another Hadoop alternative is the relatively new Apache Flink project, though Flink is mainly designed for working with streams of data.
Spark is written in Scala and runs on the JVM. There is also good support for Python, using the PySpark library. Spark is being actively developed and improved.
Who is it for?
Software engineers and data engineers who need to process large amounts of data.
When should you use it?
Spark distributes data processing across a cluster of machines. If you have to process more data than will fit into the memory of a single machine, Spark may be useful. Spark holds data in Dataframes, which can be thought of as similar to tables, with operations applied to selections of columns and rows. It uses vectorised execution via Apache Arrow, which is very efficient for querying and transforming large amounts of data.
Spark has a lazy execution model, which is a nice feature for interactive use. Internally, Spark builds a graph of operations and executes it only when it must collect results for output.
If you need to process smaller quantities of data you may not need the complexity of Spark. If your data fits into memory you may be able to use pandas instead, which still has very good performance and uses Dataframes but requires no supporting infrastructure or configuration.
Why should you use it?
If you are already using pandas but running out of memory, or operations are taking too long to compute, then Spark could be a useful next step. There is a library called koalas that is pandas API compatible (with a few omissions), but converts API calls into PySpark API calls. This enables you to have a single codebase that can target either pandas or Spark back-ends.
How do you use it?
Spark comes as part of a number of commercial offerings such as Databricks and AWS Glue. These are hosted Spark instances with extra bells and whistles.
Databricks is a platform that brings notebooking and Spark together, which can be very handy if users need to interrogate the data using ad-hoc queries in Scala, Python or R notebooks. The platform consists of a control plane that runs on CoreOS, and users with appropriate permissions can start clusters of machines to process data. Databricks has its own notebook implementation that is somewhat similar to Jupyter, but not quite as feature-rich. The notebook server runs as part of the control plane, and users can choose which Spark cluster their notebooks run on. This is a very powerful and flexible implementation compared to the traditional approach of starting a cluster and running a notebook server on it.
Databricks is available as Azure Databricks or as a Private Virtual Cloud (PVC) offering in which you can host and control your own Databricks environment, although this latter approach does incur additional organisational overhead.
AWS Glue is a managed Spark service that allows you to specify and run ETL jobs on data held in S3, DynamoDB, RDS, Redshift, databases running on EC2 instances and JDBC connections. DocumentDB and MongoDB are not currently supported. Glue can output data into Athena, EMR, S3 and Redshift. Jobs can be written in Scala or Python, with lots of Python libraries available for use. Leveraging this in-built Python library support means Glue can be used for generating analytics data and some machine learning models, as well as for data ETL. In order for Glue to access your data, an AWS Glue Crawler has to be able to analyse its structure and add it to the AWS Glue Data Catalog.
Are there any common gotchas?
The Databricks permission model is generally not quite fine-grained enough. Separate permissions can be granted for running notebooks on clusters, being able to stop and start those clusters, and being able to modify the cluster configuration. A similar permissions model exists for shared directories of notebooks, however to enable users to write to the directories they need to be granted permissions that also enable them to delete all the data, which isn't great from an information governance perspective.
Databricks has very basic support for Git integration with notebooks, but the UX is clunky and you have to ensure that you have a Git repo server that is accessible over the network. If you're running Databricks inside a secure environment without internet access this can be a challenge.
Databricks clusters can either be run in a mode that allows access to all of the data in a particular metastore, or they can be locked down using the table ACL feature so they can only see some of the data. R notebooks and the koalas Python library will not run on table ACL clusters, however.
Take care when using null values in Spark Dataframes. Spark has some "interesting" implementation "features" involving null values that can cause data to be included in a Spark Dataframe when the value of a key you're querying is null.