正好朋友有个爬虫的需求,拿来练习一下:
需求
根据朋友提供的appid列表,爬取回对应的app信息。
找到目标页面
chrome打开七麦官网:https://www.qimai.cn/,右上角有一个查询appid的输入框。
输入appid后回车(举例:1279207754),会自动跳转到对应的app信息页:https://www.qimai.cn/app/rank/appid/1279207754/country/cn。
在左侧有一个
应用信息
的链接,点击跳转到https://www.qimai.cn/app/baseinfo/appid/1279207754/country/cn,可以看出这就是我们要爬取的页面了。分析页面
打开chrome调试工具,查看上图两处的element
可以看出,应用的名称是在class为
.app-body .p-title
里的元素内,而应用的详细信息是在class为.appinfo-pkg .baseinfo-list
的元素内。应用名比较好抓取,详细信息需要提取出.li
列表里每个.type
和.info
的值,组成字典。
开始写代码
首先写一个无界面的selenium webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from fake_useragent import UserAgent
import time
options = webdriver.ChromeOptions()
options.add_argument("user-agent=" + UserAgent().random)
# options.add_argument('--headless') # 无界面浏览
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1366,768')
options.add_argument('disable-infobars')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options)
然后打开目标页面,appid后面改成从文件里循环读取
url = 'https://www.qimai.cn/app/baseinfo/appid/{}/country/cn'
appid = '1279207754'
driver.get(url.format(appid))
爬取应用名称
time.sleep(5)
title = driver.find_element(By.CSS_SELECTOR, '.app-body .p-title').get_attribute('innerText')
print(title)
爬取应用详细信息
info_dict = {}
info_list = driver.find_elements(By.CSS_SELECTOR, '.appinfo-pkg .baseinfo-list li')
for ele in info_list:
one_type = ele.find_element(By.CSS_SELECTOR, '.type').get_attribute('innerText')
one_info = ele.find_element(By.CSS_SELECTOR, '.info').get_attribute('innerText')
info_dict[one_type] = one_info
print(info_dict)
driver.quit()
本来以为会很顺利,结果遇到页面报错,网页直接跳到404页面!
第一次遇到这种情况,经过网上搜索,发现是网站有检测爬虫,使用的是navigator.webdriver的js判断。解决方法也简单,driver的赋值那里添加如下代码:
# 修改webdriver值
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})
这回浏览器顺利打开了目标页面,但是依然有报错
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".info"}
不难看出,这是one_info = ele.find_element(By.CSS_SELECTOR, '.info').get_attribute('innerText')
这句代码出的错,应该是有些li里没有info,这里我们加个异常处理
try:
one_type = ele.find_element(By.CSS_SELECTOR, '.type').get_attribute('innerText')
one_info = ele.find_element(By.CSS_SELECTOR, '.info').get_attribute('innerText')
info_dict[one_type] = one_info
except Exception as e:
print(e)
这下可以正常输出信息了:
读取文件和写入文件
把朋友给的appid列表存到qimai_ids.txt
文档里,使用open
方法读取
with open('qimai_ids.txt', 'r', newline='') as f:
appid_list = [line.strip() for line in f.readlines()]
print(appid_list)
然后把应用信息写入到qimai.csv
表格里
with open(self.csv_file, 'a', newline='') as f:
# 生成csv操作对象
writer = csv.writer(f)
# 写入csv文件
try:
writer.writerow(app_info)
except Exception as e:
print(e)
print('!!!写入失败:%s' % app_info[0])
注意,这里的app_info
,是我把title
和info_dict
组合成一个新的list,具体代码见后面。
然后再写一个循环爬取方法,注意加上时间间隔,实测需要10s间隔,否则容易被检测到封禁IP。
完整代码
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv
import os
import random
from fake_useragent import UserAgent
options = webdriver.ChromeOptions()
options.add_argument("user-agent=" + UserAgent().random)
options.add_argument('--headless') # 无界面浏览
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1366,768')
options.add_argument('disable-infobars')
options.add_argument('--no-sandbox')
# 反爬
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
# 修改webdriver值
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})
class QimaiSpider(object):
def __init__(self):
self.csv_file = 'qimai.csv'
self.url = 'https://www.qimai.cn/app/baseinfo/appid/{}/country/cn'
self.record_list = []
self.appid_list = []
# 初始化表格
csv_exists = os.path.exists(self.csv_file) and os.path.getsize(self.csv_file) != 0
if not csv_exists:
with open(self.csv_file, 'a', newline='') as f:
print('csv表格不存在,初始化表格')
# 生成csv操作对象
writer = csv.writer(f)
writer.writerow(['appid', 'BundleID', '名称', '开发者', '供应商', '发布日期', '更新日期', '版本', '价格', '年龄评级'])
else:
with open(self.csv_file, 'r', newline='') as f:
reader = csv.DictReader(f)
self.record_list = [row['appid'] for row in reader if row['BundleID'] and row['BundleID'] != '-']
print('已加载成功数据 %d 个' % len(self.record_list))
# 加载appid列表
with open('qimai_ids.txt', 'r', newline='') as f:
self.appid_list = [line.strip() for line in f.readlines() if line.strip() not in self.record_list]
print('未加载成功数据 %d 个' % len(self.appid_list))
def run(self):
count = 0
total = len(self.appid_list)
for appid in self.appid_list:
self.load_app_info(appid)
count += 1
rand = random.uniform(8.0, 10.0)
print('已完成%d / %d,%ds后继续' % (count, total, int(rand)))
if count < total:
time.sleep(rand)
driver.quit()
def load_app_info(self, appid):
print('---加载appid开始:%s' % appid)
print(self.url.format(appid))
driver.get(self.url.format(appid))
time.sleep(5)
try:
info_dict = {}
title = driver.find_element(By.CSS_SELECTOR, '.app-body .p-title').get_attribute('innerText')
info_list = driver.find_elements(By.CSS_SELECTOR, '.appinfo-pkg .baseinfo-list li')
for ele in info_list:
try:
one_type = ele.find_element(By.CSS_SELECTOR, '.type').get_attribute('innerText')
one_info = ele.find_element(By.CSS_SELECTOR, '.info').get_attribute('innerText')
info_dict[one_type] = one_info
except Exception as e:
print(e)
bundle_id = info_dict.get('Bundle ID', '-')
developer = info_dict.get('开发者', '-')
supplier = info_dict.get('供应商', '-')
release_date = info_dict.get('发布日期', '-')
update_date = info_dict.get('更新日期', '-')
version = info_dict.get('版本', '-')
price = info_dict.get('价格', '-')
age = info_dict.get('年龄评级', '-')
app_info = [appid, bundle_id, title, developer, supplier, release_date, update_date, version, price, age]
self.write_csv(app_info)
print(app_info)
print('===加载appid结束:%s' % appid)
except Exception as e:
print(e)
print('!!!加载appid失败:%s' % appid)
self.write_csv([appid, '-'])
def write_csv(self, app_info):
with open(self.csv_file, 'a', newline='') as f:
# 生成csv操作对象
writer = csv.writer(f)
# 写入csv文件
try:
writer.writerow(app_info)
except Exception as e:
print(e)
print('!!!写入失败:%s' % app_info[0])
if __name__ == '__main__':
spider = QimaiSpider()
spider.run()