Spark MLlib介紹與使用(2)

Spark MLlib介紹與使用(2)_線性回歸

- 1月 04, 2025

[cloudera@cdh6 ~]$ nano lr_train.csv

[cloudera@cdh6 ~]$ [cloudera@cdh6 ~]$ cat lr_train.csv

21.6 1:208

15.5 1:152

10.4 1:113

31.0 1:227

13.0 1:137

32.4 1:238

19.0 1:178

10.4 1:104

19.0 1:191

11.8 1:130

26.5 1:220

16.0 1:140

9.5 1:100

28.3 1:200

20.1 1:150

22.6 1:170

24.5 1:200

25 1:185

14.3 1:120

[cloudera@cdh6 ~]$ hdfs dfs -put lr_train.csv

[cloudera@cdh6 ~]$ nano lr_test.csv

[cloudera@cdh6 ~]$ [cloudera@cdh6 ~]$ cat lr_test.csv

16 1:150

9 1:100

28 1:200

20 1:130

[cloudera@cdh6 ~]$ hdfs dfs -put lr_test.csv

>>> lr_train = spark.read.format("libsvm").load("lr_train.csv")

>>> lr_test = spark.read.format("libsvm").load("lr_test.csv")

>>> lr_train.show()

+-----+---------------+

|label| features|

+-----+---------------+

| 21.6|(1,[0],[208.0])|

| 15.5|(1,[0],[152.0])|

| 10.4|(1,[0],[113.0])|

| 31.0|(1,[0],[227.0])|

| 13.0|(1,[0],[137.0])|

| 32.4|(1,[0],[238.0])|

| 19.0|(1,[0],[178.0])|

| 10.4|(1,[0],[104.0])|

| 19.0|(1,[0],[191.0])|

| 11.8|(1,[0],[130.0])|

| 26.5|(1,[0],[220.0])|

| 16.0|(1,[0],[140.0])|

| 9.5|(1,[0],[100.0])|

| 28.3|(1,[0],[200.0])|

| 20.1|(1,[0],[150.0])|

| 22.6|(1,[0],[170.0])|

| 24.5|(1,[0],[200.0])|

| 25.0|(1,[0],[185.0])|

| 14.3|(1,[0],[120.0])|

+-----+---------------+

>>> lr = LinearRegression()

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

NameError: name 'LinearRegression' is not defined

>>> from pyspark.ml.regression import LinearRegression

>>> lr = LinearRegression()

>>> lrModel = lr.fit(lr_train)

>>> trainingSummary = lrModel.summary

>>> print("r2: %f" % trainingSummary.r2)

r2: 0.885768

>>> lr_test_preds = lrModel.transform(lr_test)

>>> lr_test_preds.show()

+-----+---------------+------------------+

|label| features| prediction|

+-----+---------------+------------------+

| 16.0|(1,[0],[150.0])| 16.95222846400671|

| 9.0|(1,[0],[100.0])| 9.155477795656429|

| 28.0|(1,[0],[200.0])| 24.74897913235698|

| 20.0|(1,[0],[130.0])|13.833528196666594|

+-----+---------------+------------------+

>>>

讀取數據：

使用spark.read.format("libsvm")來讀取存儲在"lr_train.csv"和"lr_test.csv"的數據。這裡指定格式為"libsvm"，這是一種常用於機器學習的檔案格式，適用於存儲稀疏數據。

lr_train.show()顯示訓練數據的一部分。

每行數據包含一個標籤（label）和一組特徵（features）。

模型訓練前的錯誤：

嘗試創建LinearRegression對象時出現錯誤，因為未導入相應的類。

再補導入後即可創建線性回歸模型，透過

from pyspark.ml.regression import LinearRegression導入線性回歸類。

創建LinearRegression的實例，並用訓練數據lr_train來擬合模型。

模型擬合和評估：

lr.fit(lr_train)返回一個擬合的模型lrModel。

trainingSummary = lrModel.summary提供了訓練過程的摘要，可以用來評估模型性能。

print("r2: %f" % trainingSummary.r2)打印出決定係數R2

在這個案例中，線性回歸模型的決定係數 𝑅2

R2是 0.885768，這是一個相對高的值。

決定係數 𝑅2 的範圍在 0 到 1 之間，值越接近 1 表示模型對數據的擬合度越好。在許多實際應用中，值超過 0.8 已經被認為是很好的結果，表示模型能夠解釋大部分的變異性。

評估 𝑅2值

超過 0.8 : 通常表明模型能有效地預測或解釋數據中的變異。

0.5 至 0.8 : 可能表示模型有一定的預測能力，但可能需要進一步的優化或加入更多解釋變數。

低於 0.5 : 則可能表明模型不適合數據，或者數據之間的關聯性不強，趨近0,表示linear regression不適合此資料集。

再測試一份資料集

libsvm格式補充說明

第一個欄位是Label，另外10個feature。

[cloudera@cdh6 ~]$ head sample_linear_regression_data.txt

-9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.33109790358914726 6:0.8067445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.5658035575800715

0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430441812433 6:0.18869982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.17485476357259122

-4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.9371635223468384 6:-0.2841601610457427 7:0.6355938616712786 8:-0.1646249064941625 9:0.9480713629917628 10:0.42681251564645817

-19.782762789614537 1:-0.0388509668871313 2:-0.4166870051763918 3:0.8997202693189332 4:0.6409836467726933 5:0.273289095712564 6:-0.26175701211620517 7:-0.2794902492677298 8:-0.1306778297187794 9:-0.08536581111046115 10:-0.05462315824828923

-7.966593841555266 1:-0.06195495876886281 2:0.6546448480299902 3:-0.6979368909424835 4:0.6677324708883314 5:-0.07938725467767771 6:-0.43885601665437957 7:-0.608071585153688 8:-0.6414531182501653 9:0.7313735926547045 10:-0.026818676347611925

-7.896274316726144 1:-0.15805658673794265 2:0.26573958270655806 3:0.3997172901343442 4:-0.3693430998846541 5:0.14324061105995334 6:-0.25797542063247825 7:0.7436291919296774 8:0.6114618853239959 9:0.2324273700703574 10:-0.25128128782199144

-8.464803554195287 1:0.39449745853945895 2:0.817229160415142 3:-0.6077058562362969 4:0.6182496334554788 5:0.2558665508269453 6:-0.07320145794330979 7:-0.38884168866510227 8:0.07981886851873865 9:0.27022202891277614 10:-0.7474843534024693

2.1214592666251364 1:-0.005346215048158909 2:-0.9453716674280683 3:-0.9270309666195007 4:-0.032312290091389695 5:0.31010676221964206 6:-0.20846743965751569 7:0.8803449313707621 8:-0.23077831216541722 9:0.29246395759528565 10:0.5409312755478819

1.0720117616524107 1:0.7880855916368177 2:0.19767407429003536 3:0.9520689432368168 4:-0.845829774129496 5:0.5502413918543512 6:-0.44235539500246457 7:0.7984106594591154 8:-0.2523277127589152 9:-0.1373808897290778 10:-0.3353514432305029

-13.772441561702871 1:-0.3697050572653644 2:-0.11452811582755928 3:-0.807098168238352 4:0.4903066124307711 5:-0.6582805242342049 6:0.6107814398427647 7:-0.7204208094262783 8:-0.8141063661170889 9:-0.9459402662357332 10:0.09666938346350307

[cloudera@cdh6 ~]$ hdfs dfs -put sample_linear_regression_data.txt

[cloudera@cdh6 ~]$

[cloudera@cdh6 ~]$ pyspark

Python 2.7.5 (default, Apr 2 2020, 13:16:51)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.3.2

/_/

Using Python version 2.7.5 (default, Apr 2 2020 13:16:51)

SparkSession available as 'spark'.

>>> from pyspark.ml.regression import LinearRegression

>>> training_data = spark.read.format("libsvm").load("sample_linear_regression_data.txt")

>>> type(training_data)

>>> training_data.show()

+-------------------+--------------------+

| label| features|

+-------------------+--------------------+

| -9.490009878824548|(10,[0,1,2,3,4,5,...|

| 0.2577820163584905|(10,[0,1,2,3,4,5,...|

| -4.438869807456516|(10,[0,1,2,3,4,5,...|

|-19.782762789614537|(10,[0,1,2,3,4,5,...|

| -7.966593841555266|(10,[0,1,2,3,4,5,...|

| -7.896274316726144|(10,[0,1,2,3,4,5,...|

| -8.464803554195287|(10,[0,1,2,3,4,5,...|

| 2.1214592666251364|(10,[0,1,2,3,4,5,...|

| 1.0720117616524107|(10,[0,1,2,3,4,5,...|

|-13.772441561702871|(10,[0,1,2,3,4,5,...|

| -5.082010756207233|(10,[0,1,2,3,4,5,...|

| 7.887786536531237|(10,[0,1,2,3,4,5,...|

| 14.323146365332388|(10,[0,1,2,3,4,5,...|

|-20.057482615789212|(10,[0,1,2,3,4,5,...|

|-0.8995693247765151|(10,[0,1,2,3,4,5,...|

| -19.16829262296376|(10,[0,1,2,3,4,5,...|

| 5.601801561245534|(10,[0,1,2,3,4,5,...|

|-3.2256352187273354|(10,[0,1,2,3,4,5,...|

| 1.5299675726687754|(10,[0,1,2,3,4,5,...|

| -0.250102447941961|(10,[0,1,2,3,4,5,...|

+-------------------+--------------------+

only showing top 20 rows

>>>

訓練前 lr定義目前演算法

演算法經過訓練後,變成模型存到lrmodel

打印從1到10的係數

y=a0*X0+a1*X1+a2*X2+...+a8*X8+a9*X9+b

>>> lr = LinearRegression()

>>> lrModel = lr.fit(training_data)

>>> print("Coefficients: %s" % str(lrModel.coefficients))

Coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]

>>>

特徵1的係數->0.0073350710225801715

特徵2的係數->0.8313757584337543

....依此類推

y=0.0073350710225801715*X1+0.8313757584337543*X2+-0.8095307954684084*X3+...+-0.619712827067017*X8+0.6956151804322931*X9+0.142285582604

>>> training_data.first()

Row(label=-9.490009878824548, features=SparseVector(10, {0: 0.4551, 1: 0.3664, 2: -0.3826, 3: -0.4458, 4: 0.3311, 5: 0.8067, 6: -0.2624, 7: -0.4485, 8: -0.0727, 9: 0.5658}))

>>> trainingSummary = lrModel.summary

>>> print("numIterations: %d" % trainingSummary.totalIterations)

numIterations: 1

>>> print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))

objectiveHistory: [0.0]

>>> trainingSummary.residuals.show()

+-------------------+

| residuals|

+-------------------+

|-11.011130022096554|

| 0.9236590911176538|

|-4.5957401897776675|

| -20.4201774575836|

|-10.339160314788181|

|-5.9552091439610555|

|-10.726906349283922|

| 2.122807193191233|

| 4.077122222293811|

|-17.316168071241652|

| -4.593044343959059|

| 6.380476690746936|

| 11.320566035059846|

|-20.721971774534094|

| -2.736692773777401|

| -16.66886934252847|

| 8.242186378876315|

|-1.3723486332690233|

|-0.7060332131264666|

|-1.1591135969994064|

+-------------------+

only showing top 20 rows

>>> print("RMSE: %f" % trainingSummary.rootMeanSquaredError)

RMSE: 10.163092

>>> print("r2: %f" % trainingSummary.r2)

r2: 0.027839

搜尋此網誌

第25個冬天

Spark MLlib介紹與使用(2)_線性回歸

留言

張貼留言

這個網誌中的熱門文章

何謂淨重(Net Weight)、皮重(Tare Weight)與毛重(Gross Weight)

(2021年度)駕訓學科筆試準備題庫歸納分析_法規是非題

Crystal Report報表開發(二)_基礎操作排版對齊_基本組成部分介紹