探索式資料分析EDA_皮爾森積差相關係數(波士頓房價)

探索式資料分析EDA_皮爾森積差相關係數(波士頓房價)_熱度圖&散點圖

- 3月 01, 2025

這邊要先確保sklearn使用版本要是1.1.3版本(之後版本都移除波士頓房價的toy dataset了)

然後numpy不能到2.x版本

要是1.26.4

https://stackoverflow.com/questions/78634235/numpy-dtype-size-changed-may-indicate-binary-incompatibility-expected-96-from

Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

CRIM：每個城鎮的人均犯罪率

ZN：住宅用地超過25,000平方英尺的比例。

INDUS：每個城鎮非零售業務用地的比例。

CHAS：查爾斯河虛擬變量（如果地塊靠近河流，則為1；否則為0）。

NOX：氮氧化物濃度（每千萬分之幾部分）。

AGE：1940年之前建造的自住戶的比例。

DIS：到波士頓五個就業中心的加權距離。

RAD：輻射性公路的可達性指數。

TAX：每10,000美元的全值財產稅率。

PTRATIO：城鎮的學生-教師比例。

RM：每一棟房子的平均房間數

LSTAT：社區人口中低收入戶人口比

MEDV：自住房的中位數價值（以1000美元計）。

皮爾森相關係數(Pearson Correlation Coefficient)

https://homepage.ntu.edu.tw/~clhsieh/biostatistic/9/9-5.htm

每個特徵跟目標price輸出值去取皮爾森相關係數

getData.py

from sklearn.datasets import load_boston
import pandas as pd
display=pd.options.display
display.max_columns = None
display.max_rows = None
display.width = None
display.max_colwidth = None
def getRawData():
    data=load_boston()
    df = pd.DataFrame(data=data.data,columns=data.feature_names)
    df.insert(0,column='PRICE',value=data.target)
    print(data.DESCR)
    return df

EDA.py

from getData import getRawData
df = getRawData()
#print(data.DESCR)
#print("============================")
#print(data)
#print(df)#共有13縱向欄位屬性,506筆橫列資料
import seaborn as sns #統計功能
import pylab as plt
sns.histplot((df['PRICE']))
plt.show()

#皮爾森積差值介於-1跟1之間，愈靠近-1或1，則關係更大。
#-1:負相關(犯罪率愈高,單價愈低。),+1:正相關(房間數愈多,單價愈高。)
corrm = df.corr()
print(corrm)
corrm = corrm.nlargest(len(corrm), columns='PRICE')
sns.set(font_scale=0.7)
sns.heatmap(corrm, annot=True,annot_kws={"size": 7},fmt='.2f')
plt.show()

可以先看單價累計直方圖資料分布情況絕大部分都在PRICE: 15~30區間

用seaborn來繪製熱度圖

不過預設沒排序過也容易眼花撩亂

可以讓其根據相關係數大到小來排序

從初步相關性抓取出來

RM：每一棟房子的平均房間數=>有正相關

LSTAT：社區人口中低收入戶人口比=>有負相關

可分別繪製他們跟PRICE之間是否確實有這樣的相關性趨勢散點圖

from getData import getRawData
df = getRawData()
import pylab as plt
features=['LSTAT','RM']
#features=['AGE','RAD']
plt.figure(figsize=(15,5))
for i, col in enumerate(features):
    plt.subplot(1, len(features), i+1)
    x=df[col]
    y=df['PRICE']
    plt.scatter(x,y)
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('PRICE')
plt.show()

Ref:

搜尋此網誌

第25個冬天

探索式資料分析EDA_皮爾森積差相關係數(波士頓房價)_熱度圖&散點圖

留言

張貼留言

這個網誌中的熱門文章

何謂淨重(Net Weight)、皮重(Tare Weight)與毛重(Gross Weight)

外貿Payment Term 付款條件(方式)常見的英文縮寫與定義

鼎新ERP_會計系統_總帳管理_財務參數設定_傳票處理