Spark MLlib介紹與使用(2)_線性回歸

 
[cloudera@cdh6 ~]$ nano lr_train.csv
[cloudera@cdh6 ~]$ nano lr_train.csv
[cloudera@cdh6 ~]$ [cloudera@cdh6 ~]$ cat lr_train.csv
21.6 1:208
15.5 1:152
10.4 1:113
31.0 1:227
13.0 1:137
32.4 1:238
19.0 1:178
10.4 1:104
19.0 1:191
11.8 1:130
26.5 1:220
16.0 1:140
9.5  1:100
28.3 1:200
20.1 1:150
22.6 1:170
24.5 1:200
25   1:185
14.3 1:120
[cloudera@cdh6 ~]$ hdfs dfs -put lr_train.csv
[cloudera@cdh6 ~]$ nano lr_test.csv
[cloudera@cdh6 ~]$ [cloudera@cdh6 ~]$ cat lr_test.csv
16 1:150
9  1:100
28 1:200
20 1:130
[cloudera@cdh6 ~]$ hdfs dfs -put lr_test.csv


>>> lr_train = spark.read.format("libsvm").load("lr_train.csv")
>>> lr_test = spark.read.format("libsvm").load("lr_test.csv")
>>> lr_train.show()
+-----+---------------+
|label|       features|
+-----+---------------+
| 21.6|(1,[0],[208.0])|
| 15.5|(1,[0],[152.0])|
| 10.4|(1,[0],[113.0])|
| 31.0|(1,[0],[227.0])|
| 13.0|(1,[0],[137.0])|
| 32.4|(1,[0],[238.0])|
| 19.0|(1,[0],[178.0])|
| 10.4|(1,[0],[104.0])|
| 19.0|(1,[0],[191.0])|
| 11.8|(1,[0],[130.0])|
| 26.5|(1,[0],[220.0])|
| 16.0|(1,[0],[140.0])|
|  9.5|(1,[0],[100.0])|
| 28.3|(1,[0],[200.0])|
| 20.1|(1,[0],[150.0])|
| 22.6|(1,[0],[170.0])|
| 24.5|(1,[0],[200.0])|
| 25.0|(1,[0],[185.0])|
| 14.3|(1,[0],[120.0])|
+-----+---------------+

>>> lr = LinearRegression()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'LinearRegression' is not defined
>>> from pyspark.ml.regression import LinearRegression
>>> lr = LinearRegression()
>>> lrModel = lr.fit(lr_train)
>>> trainingSummary = lrModel.summary
>>> print("r2: %f" % trainingSummary.r2)
r2: 0.885768
>>> lr_test_preds = lrModel.transform(lr_test)
>>> lr_test_preds.show()
+-----+---------------+------------------+
|label|       features|        prediction|
+-----+---------------+------------------+
| 16.0|(1,[0],[150.0])| 16.95222846400671|
|  9.0|(1,[0],[100.0])| 9.155477795656429|
| 28.0|(1,[0],[200.0])| 24.74897913235698|
| 20.0|(1,[0],[130.0])|13.833528196666594|
+-----+---------------+------------------+

>>>


讀取數據:
使用spark.read.format("libsvm")來讀取存儲在"lr_train.csv"和"lr_test.csv"的數據。這裡指定格式為"libsvm",這是一種常用於機器學習的檔案格式,適用於存儲稀疏數據。
lr_train.show()顯示訓練數據的一部分。
每行數據包含一個標籤(label)和一組特徵(features)。

模型訓練前的錯誤:
嘗試創建LinearRegression對象時出現錯誤,因為未導入相應的類。
再補導入後即可創建線性回歸模型,透過
from pyspark.ml.regression import LinearRegression導入線性回歸類。
創建LinearRegression的實例,並用訓練數據lr_train來擬合模型。

模型擬合和評估:
lr.fit(lr_train)返回一個擬合的模型lrModel。
trainingSummary = lrModel.summary提供了訓練過程的摘要,可以用來評估模型性能。
print("r2: %f" % trainingSummary.r2)打印出決定係數R2


在這個案例中,線性回歸模型的決定係數 𝑅2
R2是 0.885768,這是一個相對高的值。

決定係數 𝑅2 的範圍在 0 到 1 之間,值越接近 1 表示模型對數據的擬合度越好。在許多實際應用中,值超過 0.8 已經被認為是很好的結果,表示模型能夠解釋大部分的變異性。

評估 𝑅2值

超過 0.8 : 通常表明模型能有效地預測或解釋數據中的變異。
0.5 至 0.8 : 可能表示模型有一定的預測能力,但可能需要進一步的優化或加入更多解釋變數。
低於 0.5 : 則可能表明模型不適合數據,或者數據之間的關聯性不強,趨近0,表示linear regression不適合此資料集。


再測試一份資料集
libsvm格式補充說明
第一個欄位是Label,另外10個feature。



[cloudera@cdh6 ~]$ head sample_linear_regression_data.txt
-9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.33109790358914726 6:0.8067445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.5658035575800715
0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430441812433 6:0.18869982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.17485476357259122
-4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.9371635223468384 6:-0.2841601610457427 7:0.6355938616712786 8:-0.1646249064941625 9:0.9480713629917628 10:0.42681251564645817
-19.782762789614537 1:-0.0388509668871313 2:-0.4166870051763918 3:0.8997202693189332 4:0.6409836467726933 5:0.273289095712564 6:-0.26175701211620517 7:-0.2794902492677298 8:-0.1306778297187794 9:-0.08536581111046115 10:-0.05462315824828923
-7.966593841555266 1:-0.06195495876886281 2:0.6546448480299902 3:-0.6979368909424835 4:0.6677324708883314 5:-0.07938725467767771 6:-0.43885601665437957 7:-0.608071585153688 8:-0.6414531182501653 9:0.7313735926547045 10:-0.026818676347611925
-7.896274316726144 1:-0.15805658673794265 2:0.26573958270655806 3:0.3997172901343442 4:-0.3693430998846541 5:0.14324061105995334 6:-0.25797542063247825 7:0.7436291919296774 8:0.6114618853239959 9:0.2324273700703574 10:-0.25128128782199144
-8.464803554195287 1:0.39449745853945895 2:0.817229160415142 3:-0.6077058562362969 4:0.6182496334554788 5:0.2558665508269453 6:-0.07320145794330979 7:-0.38884168866510227 8:0.07981886851873865 9:0.27022202891277614 10:-0.7474843534024693
2.1214592666251364 1:-0.005346215048158909 2:-0.9453716674280683 3:-0.9270309666195007 4:-0.032312290091389695 5:0.31010676221964206 6:-0.20846743965751569 7:0.8803449313707621 8:-0.23077831216541722 9:0.29246395759528565 10:0.5409312755478819
1.0720117616524107 1:0.7880855916368177 2:0.19767407429003536 3:0.9520689432368168 4:-0.845829774129496 5:0.5502413918543512 6:-0.44235539500246457 7:0.7984106594591154 8:-0.2523277127589152 9:-0.1373808897290778 10:-0.3353514432305029
-13.772441561702871 1:-0.3697050572653644 2:-0.11452811582755928 3:-0.807098168238352 4:0.4903066124307711 5:-0.6582805242342049 6:0.6107814398427647 7:-0.7204208094262783 8:-0.8141063661170889 9:-0.9459402662357332 10:0.09666938346350307
[cloudera@cdh6 ~]$ hdfs dfs -put sample_linear_regression_data.txt
[cloudera@cdh6 ~]$


[cloudera@cdh6 ~]$ pyspark
Python 2.7.5 (default, Apr  2 2020, 13:16:51)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.3.2
      /_/

Using Python version 2.7.5 (default, Apr  2 2020 13:16:51)
SparkSession available as 'spark'.
>>> from pyspark.ml.regression import LinearRegression
>>> training_data = spark.read.format("libsvm").load("sample_linear_regression_data.txt")
>>> type(training_data)
<class 'pyspark.sql.dataframe.DataFrame'>
>>> training_data.show()
+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+-------------------+--------------------+
only showing top 20 rows

>>>
訓練前 lr定義目前演算法
演算法經過訓練後,變成模型存到lrmodel
打印從1到10的係數
y=a0*X0+a1*X1+a2*X2+...+a8*X8+a9*X9+b

>>> lr = LinearRegression()
>>> lrModel = lr.fit(training_data)
>>> print("Coefficients: %s" % str(lrModel.coefficients))
Coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]
>>>

特徵1的係數->0.0073350710225801715
特徵2的係數->0.8313757584337543
....依此類推

y=0.0073350710225801715*X1+0.8313757584337543*X2+-0.8095307954684084*X3+...+-0.619712827067017*X8+0.6956151804322931*X9+0.142285582604


>>> training_data.first()
Row(label=-9.490009878824548, features=SparseVector(10, {0: 0.4551, 1: 0.3664, 2: -0.3826, 3: -0.4458, 4: 0.3311, 5: 0.8067, 6: -0.2624, 7: -0.4485, 8: -0.0727, 9: 0.5658}))
>>> trainingSummary = lrModel.summary
>>> print("numIterations: %d" % trainingSummary.totalIterations)
numIterations: 1
>>> print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
objectiveHistory: [0.0]
>>> trainingSummary.residuals.show()
+-------------------+
|          residuals|
+-------------------+
|-11.011130022096554|
| 0.9236590911176538|
|-4.5957401897776675|
|  -20.4201774575836|
|-10.339160314788181|
|-5.9552091439610555|
|-10.726906349283922|
|  2.122807193191233|
|  4.077122222293811|
|-17.316168071241652|
| -4.593044343959059|
|  6.380476690746936|
| 11.320566035059846|
|-20.721971774534094|
| -2.736692773777401|
| -16.66886934252847|
|  8.242186378876315|
|-1.3723486332690233|
|-0.7060332131264666|
|-1.1591135969994064|
+-------------------+
only showing top 20 rows

>>> print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
RMSE: 10.163092
>>> print("r2: %f" % trainingSummary.r2)
r2: 0.027839






留言

這個網誌中的熱門文章

何謂淨重(Net Weight)、皮重(Tare Weight)與毛重(Gross Weight)

Architecture(架構) 和 Framework(框架) 有何不同?_軟體設計前的事前規劃的藍圖概念

經得起原始碼資安弱點掃描的程式設計習慣培養(五)_Missing HSTS Header