Skip to main content

What is Apache Spark?

Apache Spark is a free, open source parallel, distributed data processing framework that enables you to process all kinds of data at massive scale.


Features and benefits
of Apache Spark

Spark delivers an advanced set of capabilities for today's data engineers, data scientists, and analysts:


  • Write custom parallel, distributed data processing applications with popular languages including Python, Scala, and Java
  • Use Spark MLLib to train and evaluate machine learning models at scale, and use Spark GraphX to perform graph processing over large graphs characterized by high numbers of vertices and edges
  • Write SQL queries and process big data using common data warehousing techniques, with SparkSQL
  • Write continuous data processing applications that reliably process streaming data using Spark's Spark Streaming API

Spark is widely used to build sophisticated data pipelines, which can be made of long-lived event stream processing applications or batch-based jobs that run on a schedule. Spark is also widely adopted by data scientists for data analysis as well as for machine learning tasks including data preparation.


Why choose Spark?

  • Flexible and versatile

  • Processes big data

  • Support for multiple programming languages


Why do companies use Spark?


Proven solution

Apache Spark has been tested in real world deployments around the globe for more than a decade.


Massive scalability

Spark is designed to run at web scale – with horizontal scaling capabilities built in.


Large user community

Spark is a mature, actively developed project and has a vast user community.


How do companies use Spark?


Streaming data

You can use Spark to develop applications that process continuous data streams, for example for analyzing web clickstream data and providing real time insights and alerts at web scale.


Data science

Spark is a popular choice for empowering data scientists. Spark helps data scientists to prepare data, train models, and explore data sets using a compute cluster. With Spark's support for Python and R, data scientists can work with common and easy-to-learn languages for data analysis.


Analytics

With Spark, you can use industry standard Structured Query Language (SQL) to query your big data. Or, you can use Spark data frame API to analyze massive datasets using Python in a familiar way.


How does Spark work?

Spark runs as an application on a compute cluster, under a resource scheduler.

Users write Spark applications using the language of their choice – Python, SQL, Scala, R, or Java – and submit them to the cluster, where the Spark application can be run on many compute nodes at once, dividing the workload into tasks in order to parallelize efforts and complete processing faster.

Spark makes use of in-memory caching in order to accelerate data processing even further, but falls back to persistent storage media if memory is insufficient.

Spark applications are composed of a Driver and Executors. The driver component can run on the cluster; or it can run on the user's local machine when using an interactive session like spark-shell. The driver acts as task coordinator. Executors perform the actual processing tasks. Data is partitioned and distributed amongst the executors for enabling distributed processing.

The diagram shows a Spark client connecting to a Spark driver on a Kubernetes cluster. The Spark driver is coordinating a distributed, parallel processing job which is reading and writing data to and from a remote object storage system.

Feature breakdown

  • Horizontally scalable

    Spark applications can be scaled to increase processing capacity by adding additional executors, which enables Spark to work at petabyte dataset scale.
  • Distributed processing

    Spark workloads are distributed across many executors. Executors can be distributed in a compute cluster, which can significantly accelerate data processing times by dividing data processing tasks and distributing them across the cluster.
  • Data warehousing

    When used in conjunction with solutions like Apache Kyuubi, Spark makes an effective lakehouse engine for data warehousing at data lake scale.

Installing Spark

Spark is a distributed system developed in Java, designed for computers running Ubuntu and other Linux distributions.

You can use Charmed Apache Spark to deploy a fully supported Spark solution on Kubernetes.


Charmed Apache Spark

Charmed Spark delivers up to 15 years of support and security maintenance for Apache Spark as an integrated, turnkey solution with advanced management features.

Ubuntu Pro is Canonical's comprehensive subscription for open source security, support, and compliance. It covers the Spark framework and its vast ecosystem of dependencies, ensuring the integrity of your big data processing and ETL pipelines, securing your toolchains and frameworks across your compute clusters.


Charmed Spark

Included in Ubuntu Pro + Support

When you purchase an Ubuntu Pro + Support plan, you also get support for the full Charmed Apache Spark solution.


  • Up to 15 years of Spark support per release track
  • 24/7 or weekday phone and ticket support
  • Up to 15 years of security maintenance for Spark covering critical and high severity CVEs

Charmed Spark allows you to automate deployment and operation of Spark at massive scale in the environment of your choice – on the cloud or in your data center. It also supports deployment to most popular clouds or to CNCF-conformant Kubernetes.


Spark OCI-compliant container image

Included in Ubuntu Pro

Also included in Ubuntu Pro, you get support for Canonical's OCI-compliant container image for Spark in GitHub Container Registry (GHCR) – a securely-designed Spark container image based on Ubuntu LTS.


  • Up to 15 years of support per release track
  • Same 24/7 or weekday phone and ticket support commitment
  • Same 10 years of security maintenance covering critical and high severity CVEs in the image

Spark consultancy and support

Advanced professional services for Spark, when you need them

Get help in designing, planning, building, and even operating a hyper automated production Spark solution that perfectly fits your needs, with Canonical's expert services.


  • Help with design and build of both production and non-production Spark workloads with Charmed Spark
  • Managed services for Spark lake houses in your cloud tenancy or data center, backed by an SLA
  • Firefighting support with a Spark operations expert, who works alongside your team when crisis hits

Learn more about Spark

Get an introduction to Apache Spark and how to prioritize your security requirements.


Spark resources


Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.