Apache Spark筆記01_架構與介紹_何謂Spark中的Spark Application,Spark Context,Stage,Shuffle?

- 12月 11, 2024

Apache Spark架構圖

什麼是 Apache Spark？

Apache Spark 是一個用於大規模資料處理的快速通用引擎。
它比僅使用 RAM 的 MapReduce 快大約 100 倍，如果使用磁碟則快 10 倍。
它建立在與 MapReduce 類似的範例之上。
它與 Hadoop 整合良好，因為它可以在 YARN 之上運行並可以存取 HDFS。

上述架構圖

資源/叢集資源管理器:

是一種軟體元件，用於管理叢集中連接的機器的各種資源，例如記憶體、磁碟、CPU。

Apache Spark 可以在許多叢集資源管理器（例如 YARN、Amazon EC2 或 Mesos）之上運作。

如果還沒有任何資源管理器，則可以在獨立模式下使用 Apache Spark。

何謂Spark Core?

「Spark」更正式的底層引擎稱為「Spark Core」，Spark Core為容錯基礎引擎。
用於大規模並行和分散式數據處理。
Spark Core 管理記憶體和任務調度。
Spark Core 還包含用於定義 RDD 和其他數據類型的 API。
Spark Core 還在整個集群中並行化分散式元素集合。(parallelizes a distributed collection of elements across the cluster)

Spark應用程式架構

可協助我們去瞭解Spark如何隨著大數據進而擴展。
Spark 應用程式由 driver 程式和 executor 程序組成。
執行程式(executor)在 worker 節點上運行。
如果有足夠的記憶體和內核可用的話，Spark 可以在工作程序上啟動其他執行程序進程。
同樣，執行程式也可以採用多個內核進行多線程計算。
Spark 在 Executor 之間分配 RDD。
驅動程式和執行程式之間進行通信。驅動程式包含應用程式需要運行的Spark作業，並將作業拆分為提交給執行程序的任務。當執行程式完成任務時，驅動程式會收到任務結果。

用生活化一點比喻，如果 Apache Spark 是一個大型組織或公司。

Driver Code 將是該公司的執行管理層，負責該公司關於分配工作、獲取資本等的決策。
Executor則偏向基層員工，他們使用提供的資源完成分配給他們的工作。
Worker 節點對應於員工佔用的辦公室空間、廠房。
我們也可以添加其他 Worker 節點以逐步擴展大數據處理。

Spark Application組成(兩個主要process)

Driver Program
將作為每個應用程式的一個進程運行。可以在集群節點或其他機器上運行，作為cluster。Driver Program會運行應用程式的用戶code，創建工作並將其發送到集群。
Executors
執行程式是運行多個線程以併發執行cluster。
執行程序獨立工作。整個集群中可以有很多，並且每個節點一個或多個。
具體取決於配置。

何謂Spark Context?

Spark 上下文在應用程式啟動時啟動，並且必須在DataFrames 或 RDD 之前，由Driver Program中創建。
在上下文下創建的任何 DataFrame 或 RDD，與它相關聯。並且上下文必須在它們的生命週期內保持活動狀態。
Driver Program從user code創建出來的工作，名為“Jobs”（或可以並行執行的計算）。
(computations that can be performed in parallel)
Driver Program中的 Spark Context 將這些工作劃分成要在集群上執行的任務。

來自特定job的tasks用於操作不同的數據子集，則稱為分區(Partitions)。

也代表tasks可以在Executor中並行運行。

Spark Worker 是一個集群節點，負責執行工作。
Spark Executor 使用一定比例的本地資源，如記憶體和計算核心，在每個可用核心上

運行一個任務。

每個executor 根據driver的指示管理其數據快取。

每個 executor 都將其數據快取作為由driver表述。

通常，增加executor和可用核心會增加集群的並行性。

Task在不同的Thread中運行，直到所有核心被使用。

當Task完成後，executor將結果放在一個新的 RDD 分區中或傳回到Driver Program。

理想情況下，使用的核心數應限制在每個節點可用的總核心數內。例如，一個 8 核心的節點可以有 1 個executor進程使用 8 個核心。

https://events.prace-ri.eu/event/896/sessions/2721/attachments/997/1666/Spark_Cluster.pdf

何謂Spark中的Stage?

a set of tasks an executor can complete on the current data partition.

當前partition中executor預計要完成執行的一系列的任務。

何謂Spark中的Shuffle (隨機排序)?

當任務需要其他數據分區時，Spark 必須執行「隨機排序」。
隨機排序成本高昂，因為它們需要數據序列化、磁碟和網路 I/O。是因為它們使task能傳遞到其他dataset partitions，當 Spark 執行隨機排序時，它會在集群中重新分配數據集。
隨機標記階段之間的邊界。
後續階段中的任務必須等到該階段完成後才能開始執行，從而在一個階段到下一個階段之間形成依賴關係。

Apache Spark 附帶了一組很棒的函式庫如下:

Data frames 提供了一種以表格結構表示資料的通用方法。
可使 R 或 SQL 操作查詢資料變得可行，而無需編寫大量程式碼。
Streaming Library 使得使用 Spark 處理快速傳入的海量資料流成為可能。
MLLib是一個非常豐富的機器學習函式庫。
它提供了以分散式方式運行的非常複雜的演算法。
GraphX使得用圖表表示大量資料變得非常簡單。其證明了使用多台電腦處理圖形的演算法庫。

像是以下這些應用情境都適用於Spark

對HDFS中儲存的大數據執行ETL（提取-轉換-載入）
產生推薦，針對大量使用者的協同過濾
執行複雜的圖形計算，例如 PageRank（Google 搜尋）
使用機器學習執行即時詐欺偵測

為何Spark會比Hadoop HDFS機制來得更快呢?

回顧Map Reduce流程

在 MapReduce 中，使用者需要建立兩個程式或函數：map 和 reduce，然後提交這些任務。

MapReduce 的運行流程如下：

從 HDFS 讀取數據，並將 MapReduce 的結果回存至 HDFS。

當然多階段的 MapReduce 是非常常見的。如果任務有多個 MapReduce 階段，那麼當第一個階段將數據寫入 HDFS 後，第二個階段會從 HDFS 讀取相同的數據，這導致了大量的延遲。

Hadoop MapReduce 的缺點如下：

批量設計（Batchwise design）：每個 MapReduce 循環都需要從 HDFS 讀取並寫入數據，導致大量的延遲。
MapReduce 的結構非常僵化，因此將邏輯轉換為 MapReduce 模式非常困難。
使用 Hadoop MapReduce 無法進行記憶體內計算（In-memory computing）。

Spark：它用記憶體取代了 HDFS。Spark 不需要每次都將數據寫入 HDFS，而是直接更新記憶體，這比磁碟快約 80 倍。

Graded Quiz: Apache Spark

Q1.What are the three main components of Apache Spark architecture?

Scala; Java; Python

(O)Data; compute interface; resource management

Mesos; YARN; Kubernetes

Storage; HDFS; Python

解析

The three main components of Apache Spark architecture are data, compute interface, and resource management.

Q2.What are DataFrames in Apache Spark?

DataFrames is a distributed file system in Spark used for storing large data sets efficiently.

(O)DataFrames are a distributed collection of data organized into named columns.

DataFrames are Spark’s built-in machine learning models for predictive analytics.

DataFrames is a data format for storing graph data structures in Spark.

解析

A DataFrame is a distributed collection of data organized into named columns, and it is

built on top of the Spark SQL Resilient Distributed Dataset (RDD) API.

Q3.What is Apache Spark?

Closed-source data analysis tool

(O)In-memory framework for distributed data processing

Hardware manufacturer

Cloud storage service

解析

Spark is an open-source, in-memory application framework for distributed data processing

and iterative analysis on massive data volumes.

Q4.What is functional programming?

A programming approach that focuses solely on graphical functions and visual designs

(O)A style of programming that follows the mathematical function format

A programming method that prioritizes procedural programming over the use of mathematical functions

A programming approach that emphasizes the how to of the solution as opposed to the what of the solution

解析

In functional programming, the code or program emphasizes the what of the solution

instead of the how to of the solution.

Q5.Which of the following statements defines Resilient Distributed Datasets (RDDs)? Select all that apply.

(O)RDD is a collection of fault-tolerant elements.

(O)RDD is capable of receiving parallel operations.

RDD is a distributed database management system.

(O)RDDs are immutable.

解析

RDD is a collection of fault-tolerant elements partitioned across the cluster's nodes.

RDDs are capable of receiving parallel operations.

RDDs are immutable, meaning these databases cannot be changed once created.

Q6.What is the primary purpose of parallel programming?

To break a problem into discrete parts that can be solved sequentially

To employ specific control and coordination mechanism

(O)To use multiple compute resources to solve a computational problem

解析

Parallel programming uses multiple compute resources to solve a computational problem.

Q7.Which of the following is a benefit of DataFrames?

To scale small-scale data on a laptop

To scale from kilobytes of data on multiple laptops to petabytes on a large cluster

(O)To scale from kilobytes of data on a single laptop to petabytes on a large cluster

Supports specific data formats and storage systems

解析

DataFrames has the ability to scale from kilobytes of data on a single laptop

to petabytes on a large cluster.

搜尋此網誌

第25個冬天

Apache Spark筆記01_架構與介紹_何謂Spark中的Spark Application,Spark Context,Stage,Shuffle?

留言

張貼留言

這個網誌中的熱門文章

何謂淨重(Net Weight)、皮重(Tare Weight)與毛重(Gross Weight)

外貿Payment Term 付款條件(方式)常見的英文縮寫與定義

(2021年度)駕訓學科筆試準備題庫歸納分析_法規是非題