What is Apache Spark?

Apache Spark is a free, open source parallel, distributed data processing framework that enables you to process all kinds of data at massive scale.

Download the reference architecture guide

Features and benefits
of Apache Spark

Spark delivers an advanced set of capabilities for today's data engineers, data scientists, and analysts:

Write custom parallel, distributed data processing applications with popular languages including Python, Scala, and Java
Use Spark MLLib to train and evaluate machine learning models at scale, and use Spark GraphX to perform graph processing over large graphs characterized by high numbers of vertices and edges
Write SQL queries and process big data using common data warehousing techniques, with SparkSQL
Write continuous data processing applications that reliably process streaming data using Spark's Spark Streaming API

Spark is widely used to build sophisticated data pipelines, which can be made of long-lived event stream processing applications or batch-based jobs that run on a schedule. Spark is also widely adopted by data scientists for data analysis as well as for machine learning tasks including data preparation.

Why choose Spark?

Flexible and versatile
Processes big data
Support for multiple programming languages

Why do companies use Spark?

Proven solution

Apache Spark has been tested in real world deployments around the globe for more than a decade.

Massive scalability

Spark is designed to run at web scale – with horizontal scaling capabilities built in.

Large user community

Spark is a mature, actively developed project and has a vast user community.

How do companies use Spark?

Streaming data

You can use Spark to develop applications that process continuous data streams, for example for analyzing web clickstream data and providing real time insights and alerts at web scale.

Data science

Spark is a popular choice for empowering data scientists. Spark helps data scientists to prepare data, train models, and explore data sets using a compute cluster. With Spark's support for Python and R, data scientists can work with common and easy-to-learn languages for data analysis.

Analytics

With Spark, you can use industry standard Structured Query Language (SQL) to query your big data. Or, you can use Spark data frame API to analyze massive datasets using Python in a familiar way.

How does Spark work?

Spark runs as an application on a compute cluster, under a resource scheduler.

Users write Spark applications using the language of their choice – Python, SQL, Scala, R, or Java – and submit them to the cluster, where the Spark application can be run on many compute nodes at once, dividing the workload into tasks in order to parallelize efforts and complete processing faster.

Spark makes use of in-memory caching in order to accelerate data processing even further, but falls back to persistent storage media if memory is insufficient.

Spark applications are composed of a Driver and Executors. The driver component can run on the cluster; or it can run on the user's local machine when using an interactive session like spark-shell. The driver acts as task coordinator. Executors perform the actual processing tasks. Data is partitioned and distributed amongst the executors for enabling distributed processing.

The diagram shows a Spark client connecting to a Spark driver on a Kubernetes cluster. The Spark driver is coordinating a distributed, parallel processing job which is reading and writing data to and from a remote object storage system.

Feature breakdown

Horizontally scalable
Spark applications can be scaled to increase processing capacity by adding additional executors, which enables Spark to work at petabyte dataset scale.
Distributed processing
Spark workloads are distributed across many executors. Executors can be distributed in a compute cluster, which can significantly accelerate data processing times by dividing data processing tasks and distributing them across the cluster.
Data warehousing
When used in conjunction with solutions like Apache Kyuubi, Spark makes an effective lakehouse engine for data warehousing at data lake scale.

Installing Spark

Spark is a distributed system developed in Java, designed for computers running Ubuntu and other Linux distributions.

You can use Charmed Apache Spark to deploy a fully supported Spark solution on Kubernetes.

Try Charmed Spark

Charmed Apache Spark

Charmed Spark delivers up to 15 years of support and security maintenance for Apache Spark as an integrated, turnkey solution with advanced management features.

Ubuntu Pro is Canonical's comprehensive subscription for open source security, support, and compliance. It covers the Spark framework and its vast ecosystem of dependencies, ensuring the integrity of your big data processing and ETL pipelines, securing your toolchains and frameworks across your compute clusters.

Find out more about Ubuntu Pro ›

Charmed Spark

Included in Ubuntu Pro + Support

When you purchase an Ubuntu Pro + Support plan, you also get support for the full Charmed Apache Spark solution.

Up to 15 years of Spark support per release track
24/7 or weekday phone and ticket support
Up to 15 years of security maintenance for Spark covering critical and high severity CVEs

Charmed Spark allows you to automate deployment and operation of Spark at massive scale in the environment of your choice – on the cloud or in your data center. It also supports deployment to most popular clouds or to CNCF-conformant Kubernetes.

Learn more about Charmed Spark ›

Spark OCI-compliant container image

Included in Ubuntu Pro

Also included in Ubuntu Pro, you get support for Canonical's OCI-compliant container image for Spark in GitHub Container Registry (GHCR) – a securely-designed Spark container image based on Ubuntu LTS.

Up to 15 years of support per release track
Same 24/7 or weekday phone and ticket support commitment
Same 10 years of security maintenance covering critical and high severity CVEs in the image

Try the OCI image ›

Spark consultancy and support

Advanced professional services for Spark, when you need them

Get help in designing, planning, building, and even operating a hyper automated production Spark solution that perfectly fits your needs, with Canonical's expert services.

Help with design and build of both production and non-production Spark workloads with Charmed Spark
Managed services for Spark lake houses in your cloud tenancy or data center, backed by an SLA
Firefighting support with a Spark operations expert, who works alongside your team when crisis hits

Learn more about Spark

Get an introduction to Apache Spark and how to prioritize your security requirements.

Watch our webinar ›

Spark resources

Charmed Spark reference architecture guide
Read the reference architecture guide for Charmed Spark.
Make better decisions with open source Big Data and AI solutions
Learn how to build a smarter enterprise with a securely designed, integrated open source stack.
Building an online data hub with Spark
Building an effective, online data hub to facilitate access to enterprise data means ensuring solution scalability and reliability. Read the guide to gain insights into the value, use cases and challenges associated with building an enterprise data hub – whether on the public cloud or on-premise.

Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Categories

Industries

Partner programs

Quick links

Roles by department

Working here

Explore Canonical

Latest updates

Company highlights ›

What is Apache Spark?

Features and benefits
of Apache Spark

Why choose Spark?