目标:爬取B站用户信息,对地区分布、关注人数、播放量进行分析。(结果只爬了最早期的一部分数据。依然有bug。
参考资料:
为了方便保存数据,用了MySQL数据库。
创建MySQL数据库
新数据库
create database bili;
创建数据表
CREATE TABLE userinfo (
id BIGINT NOT NULL AUTO_INCREMENT,
uid BIGINT,
name VARCHAR(225),
sex CHAR(8),
regtime DATETIME,
coins INT,
birthday DATE,
fans INT,
attention INT,
place VARCHAR(80),
playNum BIGINT,
level INT,
exp INT,
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id)
);
unicode设置,设置所有字符串数据的编码格式。
ALTER DATABASE bili CHARACTER SET = utf8mb4 COLLATE =utf8mb4_unicode_ci;
ALTER TABLE userinfo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE name name VARCHAR(225) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE sex sex CHAR(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE place place varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
爬虫代码
import requests
import json
import pymysql
from multiprocessing.dummy import Pool as ThreadPool
import time
import random
user = "user"
passwd = "password"
db = "bili"
uids = range(10000)
def get_data(mid):
header = {
'Referer': 'http://space.bilibili.com/'+str(mid)+'/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'http://space.bilibili.com',
'Host': 'space.bilibili.com',
'AlexaToolbar-ALX_NS_PH': 'AlexaToolbar/alx-4.0',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4',
'Accept': 'application/json, text/javascript, */*; q=0.01',
}
payload = {'_': int(round(time.time() * 1000)), 'mid':mid}
# time.sleep(random.random())
try:
jscontent = requests.post('http://space.bilibili.com/ajax/member/GetInfo', headers=header, data=payload).content
jsDict = json.loads(jscontent.decode('utf-8'))
jsData = jsDict['data']
mid = jsData['mid']
name = jsData['name']
sex = jsData['sex']
regtime = jsData['regtime']
coins = jsData['coins']
birthday = jsData['birthday']
fans = jsData['fans']
attention = jsData['attention']
place = jsData['place']
playNum = jsData['playNum']
level = jsData['level_info']['current_level']
exp = jsData['level_info']['current_exp']
regtime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(regtime))
into_mysql( [mid, name, sex, regtime, coins, birthday, fans, attention, place, playNum, level, exp] )
except:
pass
#print(mid)
def into_mysql(data):
try:
cur.execute('insert into userinfo (uid, name, sex, regtime, coins, birthday, fans, attention, place, playNum, level, exp) \
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)', data)
conn.commit()
except:
pass
if __name__ =='__main__':
conn = pymysql.connect(host="localhost",user=user,passwd=passwd,db=db,use_unicode=True, charset="utf8")
cur = conn.cursor()
pool = ThreadPool(1)
results = pool.map(get_data,uids)
pool.close()
pool.join()
cur.close()
conn.close()
数据分析
B站到现在(2017.2.15 13:44)用户数量达到90568280,而且一直在增加中。不太懂多线程、分布式的东西,在一台电脑上爬很慢的。而且b站有反爬虫机制,访问太频繁了会出现验证信息。勉为其难的用现有的数据做分析看看。
读取数据
import pandas as pd
#import pymysql
import matplotlib.pyplot as plt
import seaborn as sns
try:
data = pd.read_csv('D:/bili_user.csv', encoding='utf-8', index_col='id')
except:
user = "user"
passwd = "password"
db = "bili"
conn = pymysql.connect(host="localhost",user=user,passwd=passwd,db=db,use_unicode=True, charset="utf8")
cur = conn.cursor()
cur.execute('select DISTINCT * from userinfo;')
raw_data = cur.fetchall()
cur.close()
conn.close()
columns = ['id', 'userID', 'name', 'sex', 'regtime', 'coins', 'birthday', 'fans', 'attention', 'place', 'playNum', 'level', 'exp', 'created']
df = pd.DataFrame(list(raw_data), columns=columns, index_col='id')
print(df.head())
df.to_csv('D:/bili_user.csv', encoding='utf-8')
data = pd.read_csv('D:/bili_user.csv', encoding='utf-8', index_col='id')
data.drop_duplicates('userID', inplace=True)
data['sex'].fillna("未填写", inplace=True)
一共一万多条:
查看性别比例
data.groupby(['sex']).count()['userID'].plot.pie()
plt.show()
现在男女比例其实差不多的,而且不填写性别的站很少一部分。因为这只是爬到10年之前的用户,所以结果和现在不太一样。
粉丝最多的20个账号
data.sort_values(by='fans', ascending=False, inplace=True)
data[['userID','name','fans']].head(20)
data['fans'].head(20).plot.bar()
plt.show()
'''
userID name fans
id
18052 122879 敖厂长 2115125
17478 883968 暴走漫画 2045366
18231 221648 柚子木字幕组 1599162
17230 777536 LexBurner 1590006
18758 375375 伊丽莎白鼠 1529054
18947 486183 排骨教主 1276675
18368 585267 纯黑哥居然被用了 1222528
17668 1643718 山下智博 1180853
19309 423895 怕上火暴王老菊 1026379
19117 391679 A路人 867697
2151 7714 女孩为何穿短裙 590793
3841 11073 hanser 409310
4189 13046 少年Pi 378439
4373 14082 山新 195274
74 79 saber酱 162584
427 608 晚香玉 129534
2 2 碧诗 128235
9526 33696 Lov 115770
5801 19919 百合花开 114397
14011 44524 螺螺螺螺螺螺螺 108256
'''
分布地区
datap = data[['userID', 'place']].dropna()
datap['place_s'] = datap['place'].str.split(' ').str.get(0)
datap.groupby(['place_s']).count()['userID'].plot.bar()
plt.show()