问题描述:从sklearn中导入加州房价数据集:
from sklearn.datasets import fetch_california_housing, get_data_home
import numpy as np
print(get_data_home())
features, labels= fetch_california_housing(return_X_y=True)
print(features.shape, labels.shape)
报错如下:
urllib.error.HTTPError: HTTP Error 403: Forbidden
解决方案
打开...\site-packages\sklearn\datasets_california_housing.py文件,在Line42可以获得数据集的链接:
# The original data can be found at:
# https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
手动下载该数据集,并放在get_data_home()返回的文件夹里面
from sklearn.datasets import fetch_california_housing, get_data_home
print(get_data_home())
最后,修改_california_housing.py line154
#cal_housing = joblib.load(filepath)
with tarfile.open(mode="r:gz", name=filepath) as f:
cal_housing = np.loadtxt(
f.extractfile("CaliforniaHousing/cal_housing.data"), delimiter=","
)
# Columns are not in the same order compared to the previous
# URL resource on lib.stat.cmu.edu
columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
cal_housing = cal_housing[:, columns_index]
然后运行:
from sklearn.datasets import fetch_california_housing, get_data_home
import numpy as np
print(get_data_home())
features, labels= fetch_california_housing(return_X_y=True)
print(features.shape, labels.shape)
print(features[0])
print(labels[0])
运行结果如下:
(20640, 8) (20640,)
[ 8.3252 41. 6.98412698 1.02380952 322.
2.55555556 37.88 -122.23 ]
4.526