一、概述:
1、获取数据集:
- 小数据集:sklearn.datasets.load_*
- 大数据集:sklearn.datasets.fetch_*
2、数据集返回值介绍:
- 返回值类型是bunch-是一个字典类型
- 返回值的属性:
(1) data:特征数据数组
(2) target:标签(目标)数组
(3) DESCR:数据描述
(4)feature_names:特征名
(5)target_names:标签(目标值)名
3、数据集划分:
- from sklearn.model_selection import train_test_split
- 参数:
(1) x -- 特征值
(2) y -- 目标值
(3) test_size -- 测试值大小
(4) random_state -- 随机数种子 - 返回值:注意顺序
(1) x_train,x_test,y_train,y_test
二、分述:
1、获取数据:
1.1 sklearn本地直接获取小数据集收集鸢尾花数据:
# coding:utf-8
from sklearn.datasets import load_iris
iris = load_iris()
print(iris)
- 1.1 运行结果1+2部分合起来:
1.2 获取大数据集,例如新闻:
# coding:utf-8
from sklearn.datasets import load_iris, fetch_20newsgroups
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# 获取大数据集,例如新闻
news = fetch_20newsgroups()
print(news)
-
1.2 运行结果拿到JSON在线解析器查看更加方便:
2、sklearn数据集返回值介绍:
-
2.1、load和fetch返回的数据类型datasets.base.Bunch(字典格式)
data:特征数据数组,是[n_samples*n_features]的二维numpy.ndarray数组
target:标签数组,是n_samples的一维numpy.ndarray数组
DESCR:数据描述
feature_names:特征名,新闻数据,手写数字、回归数据等
target_names:标签名 2.2、sklearn数据集返回值介绍代码示范🌰:
# coding:utf-8
from sklearn.datasets import load_iris, fetch_20newsgroups
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# 1.数据集获取
# 1.1 本地直接获取小数据集,例如鸢尾花:
iris = load_iris()
print(iris)
# 1.2 大数据集获取,例如新闻
# news = fetch_20newsgroups()
# print(news)
# 2.数据集属性描述
print("数据集特征值是:\n", iris.data)
print("数据集目标值是:\n", iris["target"])
print("数据集的特征值名字是:\n", iris.feature_names)
print("数据集的目标值名字是:\n", iris.target_names)
print("数据集的描述:\n", iris.DESCR)
-
2.3、sklearn数据集返回值介绍运行结果:
3、数据可视化:
- 3.1、代码示范🌰:
# coding:utf-8
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris, fetch_20newsgroups
import ssl
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
ssl._create_default_https_context = ssl._create_unverified_context
# 1.数据集获取
# 1.1 本地直接获取小数据集,例如鸢尾花:
iris = load_iris()
# print(iris)
# 1.2 大数据集获取,例如新闻
# news = fetch_20newsgroups()
# print(news)
# 2.数据集属性描述
# print("数据集特征值是:\n", iris.data)
# print("数据集目标值是:\n", iris["target"])
# print("数据集的特征值名字是:\n", iris.feature_names)
# print("数据集的目标值名字是:\n", iris.target_names)
# print("数据集的描述:\n", iris.DESCR)
# 3.数据集的可视化
iris_d = pd.DataFrame(data=iris.data, columns=['Sepal_Length', 'Speal_Width', 'Petal_Length', 'Petal_Width'])
iris_d["target"] = iris.target
print(iris_d)
def iris_plot(data, col1, col2):
sns.lmplot(x=col1, y=col2, data=data, hue="target", fit_reg=False)
plt.title("鸢尾花数据显示")
plt.show()
iris_plot(iris_d, 'Speal_Width', 'Petal_Length')
iris_plot(iris_d, 'Sepal_Length', 'Petal_Width')
- 3.2、鸢尾花可视化效果图:
4、数据集的划分
4.1、 机器学习一般的数据集会划分为两部分:
(1) 训练集数据:用于训练,构建模型
(2) 测试数据:在模型检验时使用,用于评估模型是否有效4.2、划分比例:
(1)训练集:70%、80%、75%
(2)测试集:30%、20%、25%4.3、数据及划分api:
sklearn.model_selection.train_test_split(arrays,*options)参数:
(1) x数据集的特征值
(2) y数据集的特征值
(3) test_size测试集的大小,一般为float
(4)random_state随机数种子,不同的种子会造成不同的随机采样结果,相同的种子采样结果相同return: x_train, x_test, y_train, y_test
- 4.5、代码示范🌰:
# coding:utf-8
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris, fetch_20newsgroups
from sklearn.model_selection import train_test_split
import ssl
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
ssl._create_default_https_context = ssl._create_unverified_context
# 1.数据集获取
# 1.1 本地直接获取小数据集,例如鸢尾花:
iris = load_iris()
# print(iris)
# 1.2 大数据集获取,例如新闻
# news = fetch_20newsgroups()
# print(news)
# 2.数据集属性描述
# print("数据集特征值是:\n", iris.data)
# print("数据集目标值是:\n", iris["target"])
# print("数据集的特征值名字是:\n", iris.feature_names)
# print("数据集的目标值名字是:\n", iris.target_names)
# print("数据集的描述:\n", iris.DESCR)
# 3.数据集的可视化
iris_d = pd.DataFrame(data=iris.data, columns=['Sepal_Length', 'Speal_Width', 'Petal_Length', 'Petal_Width'])
iris_d["target"] = iris.target
print(iris_d)
def iris_plot(data, col1, col2):
sns.lmplot(x=col1, y=col2, data=data, hue="target", fit_reg=False)
plt.title("鸢尾花数据显示")
plt.show()
# iris_plot(iris_d, 'Speal_Width', 'Petal_Length')
# iris_plot(iris_d, 'Sepal_Length', 'Petal_Width')
# 4.数据集的划分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("训练集的特征值是:\n", x_train)
print("训练集的目标值是:\n", y_train)
print("测试集的特征值是:\n", x_test)
print("测试集的目标值是:\n", y_test)
print("训练集的目标值的形状是:\n", y_train.shape)
print("测试集的目标值的形状是:\n", y_test.shape)
# 不一样的random_state,结果不一样
x_train1, x_test1, y_train1, y_test1 = train_test_split(iris.data, iris.target, test_size=0.2, random_state=2)
x_train2, x_test2, y_train2, y_test2 = train_test_split(iris.data, iris.target, test_size=0.2, random_state=2)
print("测试集的目标值是:\n", y_test)
print("测试集的目标值1是:\n", y_test1)
print("测试集的目标值2是:\n", y_test2)
- 4.2、运行结果:
/Users/weixiujuan/PycharmProjects/untitled2/venv/bin/python /Users/weixiujuan/PycharmProjects/untitled2/人工智能/2.数据集介绍.py
训练集的特征值是:
[[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[5.7 2.8 4.5 1.3]
[5. 3.4 1.6 0.4]
[5.1 3.4 1.5 0.2]
[4.9 3.6 1.4 0.1]
[6.9 3.1 5.4 2.1]
[6.7 2.5 5.8 1.8]
[7. 3.2 4.7 1.4]
[6.3 3.3 4.7 1.6]
[5.4 3.9 1.3 0.4]
[4.4 3.2 1.3 0.2]
[6.7 3. 5. 1.7]
[5.6 3. 4.1 1.3]
[5.7 2.5 5. 2. ]
[6.5 3. 5.8 2.2]
[5. 3.6 1.4 0.2]
[6.1 2.8 4. 1.3]
[6. 3.4 4.5 1.6]
[6.7 3. 5.2 2.3]
[5.7 4.4 1.5 0.4]
[5.4 3.4 1.7 0.2]
[5. 3.5 1.3 0.3]
[4.8 3. 1.4 0.1]
[5.5 4.2 1.4 0.2]
[4.6 3.6 1. 0.2]
[7.2 3.2 6. 1.8]
[5.1 2.5 3. 1.1]
[6.4 3.2 4.5 1.5]
[7.3 2.9 6.3 1.8]
[4.5 2.3 1.3 0.3]
[5. 3. 1.6 0.2]
[5.7 3.8 1.7 0.3]
[5. 3.3 1.4 0.2]
[6.2 2.2 4.5 1.5]
[5.1 3.5 1.4 0.2]
[6.4 2.9 4.3 1.3]
[4.9 2.4 3.3 1. ]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[5.9 3.2 4.8 1.8]
[5.4 3.9 1.7 0.4]
[6. 2.2 4. 1. ]
[6.4 2.8 5.6 2.1]
[4.8 3.4 1.9 0.2]
[6.4 3.1 5.5 1.8]
[5.9 3. 4.2 1.5]
[6.5 3. 5.5 1.8]
[6. 2.9 4.5 1.5]
[5.5 2.4 3.8 1.1]
[6.2 2.9 4.3 1.3]
[5.2 4.1 1.5 0.1]
[5.2 3.4 1.4 0.2]
[7.7 2.6 6.9 2.3]
[5.7 2.6 3.5 1. ]
[4.6 3.4 1.4 0.3]
[5.8 2.7 4.1 1. ]
[5.8 2.7 3.9 1.2]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]
[4.6 3.1 1.5 0.2]
[5.8 2.8 5.1 2.4]
[5.1 3.5 1.4 0.3]
[6.8 3.2 5.9 2.3]
[4.9 3.1 1.5 0.1]
[5.5 2.3 4. 1.3]
[5.1 3.7 1.5 0.4]
[5.8 2.7 5.1 1.9]
[6.7 3.1 4.4 1.4]
[6.8 3. 5.5 2.1]
[5.2 2.7 3.9 1.4]
[6.7 3.1 5.6 2.4]
[5.3 3.7 1.5 0.2]
[5. 2. 3.5 1. ]
[6.6 2.9 4.6 1.3]
[6. 2.7 5.1 1.6]
[6.3 2.3 4.4 1.3]
[7.7 3. 6.1 2.3]
[4.9 3. 1.4 0.2]
[4.6 3.2 1.4 0.2]
[6.3 2.7 4.9 1.8]
[6.6 3. 4.4 1.4]
[6.9 3.1 4.9 1.5]
[4.3 3. 1.1 0.1]
[5.6 2.7 4.2 1.3]
[4.8 3.4 1.6 0.2]
[7.6 3. 6.6 2.1]
[7.7 2.8 6.7 2. ]
[4.9 2.5 4.5 1.7]
[6.5 3.2 5.1 2. ]
[5.1 3.3 1.7 0.5]
[6.3 2.9 5.6 1.8]
[6.1 2.6 5.6 1.4]
[5. 3.4 1.5 0.2]
[6.1 3. 4.6 1.4]
[5.6 3. 4.5 1.5]
[5.1 3.8 1.5 0.3]
[5.6 2.8 4.9 2. ]
[4.4 3. 1.3 0.2]
[5.5 2.4 3.7 1. ]
[4.7 3.2 1.6 0.2]
[6.7 3.3 5.7 2.5]
[5.2 3.5 1.5 0.2]
[6.4 2.7 5.3 1.9]
[6.3 2.8 5.1 1.5]
[4.4 2.9 1.4 0.2]
[6.1 3. 4.9 1.8]
[4.9 3.1 1.5 0.2]
[5. 2.3 3.3 1. ]
[4.8 3. 1.4 0.3]
[5.8 4. 1.2 0.2]
[6.3 3.4 5.6 2.4]
[5.4 3. 4.5 1.5]
[7.1 3. 5.9 2.1]
[6.3 3.3 6. 2.5]
[5.1 3.8 1.9 0.4]
[6.4 2.8 5.6 2.2]
[7.7 3.8 6.7 2.2]]
训练集的目标值是:
[0 0 1 1 1 0 0 0 2 2 1 1 0 0 1 1 2 2 0 1 1 2 0 0 0 0 0 0 2 1 1 2 0 0 0 0 1
0 1 1 1 1 1 0 1 2 0 2 1 2 1 1 1 0 0 2 1 0 1 1 2 2 0 2 0 2 0 1 0 2 1 2 1 2
0 1 1 1 1 2 0 0 2 1 1 0 1 0 2 2 2 2 0 2 2 0 1 1 0 2 0 1 0 2 0 2 2 0 2 0 1
0 0 2 1 2 2 0 2 2]
测试集的特征值是:
[[5.4 3.7 1.5 0.2]
[6.4 3.2 5.3 2.3]
[6.5 2.8 4.6 1.5]
[6.3 2.5 5. 1.9]
[6.1 2.9 4.7 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3.1 4.7 1.5]
[6. 3. 4.8 1.8]
[5.6 2.9 3.6 1.3]
[5. 3.2 1.2 0.2]
[6.9 3.2 5.7 2.3]
[5.7 3. 4.2 1.2]
[7.4 2.8 6.1 1.9]
[7.2 3.6 6.1 2.5]
[5. 3.5 1.6 0.6]
[7.9 3.8 6.4 2. ]
[5.6 2.5 3.9 1.1]
[5.7 2.8 4.1 1.3]
[6. 2.2 5. 1.5]
[5.7 2.9 4.2 1.3]
[5.1 3.8 1.6 0.2]
[6.9 3.1 5.1 2.3]
[5.5 3.5 1.3 0.2]
[5.8 2.6 4. 1.2]
[5.8 2.7 5.1 1.9]
[4.7 3.2 1.3 0.2]
[7.2 3. 5.8 1.6]
[6.5 3. 5.2 2. ]
[6.7 3.3 5.7 2.1]
[6.2 2.8 4.8 1.8]]
测试集的目标值是:
[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 2 1 0 2 0 1 2 0 2 2 2 2]
训练集的目标值的形状是:
(120,)
测试集的目标值的形状是:
(30,)
测试集的目标值是:
[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 2 1 0 2 0 1 2 0 2 2 2 2]
测试集的目标值1是:
[0 0 2 0 0 2 0 2 2 0 0 0 0 0 1 1 0 1 2 1 1 1 2 1 1 0 0 2 0 2]
测试集的目标值2是:
[0 0 2 0 0 2 0 2 2 0 0 0 0 0 1 1 0 1 2 1 1 1 2 1 1 0 0 2 0 2]
Process finished with exit code 0