第二章爬虫基础知识

技术选型

scrapy vs requests+beautifulsoup

requests+beautifulsoup都只是第三方模块,scrapy则是框架。
scrapy框架中可以加入requests和beautifulsoup。
scrapy基于twisted,性能是最大的优化。
scrapy方便苦战，提供很多内置的功能。
scrapy内置的css和xpath selector非常方便，beautifulsoup最大的缺点就是慢。

网页分类

常见网页类型

静态网页
动态网页
webservice(restapi)

爬虫能做什么

爬虫作用

搜索引擎--百度、谷歌、垂直领域搜索引擎
推荐引擎--今日头条
机器学习的数据样本
数据分析(金融数据分析)、舆情分析等

正则表达式

正则表达式

特殊字符

[1]. ^ $ * ? + {2} {2,} {2,5} |

[2]. [] [^] [a-z] .

[3]. \s \S \w \W

[4]. [\u4E00-\u9FA5] () \d

#!/usr/bin/env python3
# _*_ coding: utf-8 _*_
"""
 @author 金全 JQ
 @version 1.0 , 2017/9/30
 @description 正则表达式
"""
import re

line ='jinquan123'

# 以j开头任意结尾的字符串
regex_str = "^j.*"
# 前任意内容但是以3结尾
regex_end = ".*3$"
# 以j开头任意中间内容以3结尾
regex_all = "^j.*3$"

re.match(regex_str,line)

# 贪婪和非贪婪模式
line_greedy = 'booobbb123'
# 贪婪模式是默认从右边提取()代表子集
regex_greedy_right = ".*(b.*b).*"
match_result_right = re.match(regex_greedy_right,line_greedy)
# result "bb"
if match_result_right:
    print (match_result_right.group(1))

# 贪婪模式结果从左边取并取右边
regex_greedy_all = ".*?(b.*b).*"
match_result_all = re.match(regex_greedy_all,line_greedy)
# result "booobbb"
if match_result_all:
    print(match_result_all.group(1))

# 非贪婪模式从左边取
regex_greedy_left = ".*?(b.*?b).*"
match_result_left = re.match(regex_greedy_left,line_greedy)
# result booob
if match_result_left:
    print (match_result_left.group(1))

# 限定池 +
line_limit = "booobbbaabb123"
regex_limit_greedy = ".*(b.*b).*"
match_result_limit = re.match(regex_limit_greedy,line_limit)
# result bb
if match_result_limit :
    print (match_result_limit.group(1))

regex_limit_between = ".*(b.+b).*"
match_result_limit_between = re.match(regex_limit_between,line_limit)
# result baab
if match_result_limit_between:
    print(match_result_limit_between.group(1))

# 限定池 {2} {2，} {2,4} 这里比较复杂 请自行调试
line_list = "booooobbbaaab123"
regex_list_low = ".*(b.{1}b).*"
match_result_list_low = re.match(regex_list_low,line_list)
if match_result_list_low:
    print(match_result_list_low.group(1))
else:
    print("none")

regex_list_all = ".*(b.{2,}b).*"
match_result_list_all = re.match(regex_list_all,line_list)
if match_result_list_all:
    print(match_result_list_all.group(1))

regex_list_high = ".*(b.{2,3}b).*"
match_result_list_high = re.match(regex_list_high,line_list)
if match_result_list_high:
    print(match_result_list_high.group(1))

# | 或的关系
line_or = "jinquan123"
regex_or_one = "jinquan|jinquan123"
match_result_one = re.match(regex_or_one,line_or)
if match_result_one:
    print(match_result_one)

regex_or_two = "(jiquan|jinquan)123"
match_result_or_two = re.match(regex_or_two,line_or)
if match_result_or_two:
    print(match_result_or_two.group(1))

regex_or_three = "((jiquan|jinquan)123)"
match_result_or_three = re.match(regex_or_three,line_or)
if match_result_or_three:
    print(match_result_or_three.group(1))

# [] 作用
line_number = "18146456231"
regex_number_one = "(1[385][0-9]{9})"
match_result_number_one = re.match(regex_number_one,line_number)
if match_result_number_one:
    print(match_result_number_one.group(1))

# [^0] 不等于0
regex_number_one = "(1[385][^0]{9})"
match_result_number_one = re.match(regex_number_one,line_number)
if match_result_number_one:
    print(match_result_number_one.group(1))

# \s 代表空格 \S表示不为空格都可以
line_str_nbsp = "你 好"
regex_nbsp_one ="(你\s好)"
match_result_nbsp = re.match(regex_nbsp_one,line_str_nbsp)
if match_result_nbsp:
    print(match_result_nbsp.group(1))

# \w类似[A-Za-z0-9_] \W表示不为这些的时候
line_str_w = "你m好"
regex_w_one ="(你\w好)"
match_result_w = re.match(regex_w_one,line_str_w)
if match_result_w:
    print(match_result_w.group(1))

# [\u4E00-\u9FA5]表示中文内容
line_str_c = "你好"
regex_c_one ="([\u4E00-\u9FA5]+)"
match_result_c = re.match(regex_c_one,line_str_c)
if match_result_c:
    print(match_result_c.group(1))
else:
    print("none")
line_str_c_two = "study in 滁州学院"
regex_c_two =".*?([\u4E00-\u9FA5]+学院)"
match_result_c_two = re.match(regex_c_two,line_str_c_two)
if match_result_c_two:
    print(match_result_c_two.group(1))
else:
    print("none")

# \d数字提取
line_number_year = "xxx出生于1994年12月12日"
regex_year_mounth_day = ".*?((\d+)年(\d+)月(\d+)日)"
match_result_year_mounth_day = re.match(regex_year_mounth_day,line_number_year)
if match_result_year_mounth_day:
    print(match_result_year_mounth_day.group(1))
else:
    print("none")

# 日期提取
line_year_mounth_day_one = "XXX出生于1994年1月12日"
line_year_mounth_day_two = "XXX出生于1994-1-12"
line_year_mounth_day_three = "XXX出生于1994/1/12"
line_year_mounth_day_four = "XXX出生于1994-01-12"
line_year_mounth_day_five = "XXX出生于1994-01"
regex_year_mounth_day_all = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}|[月/-]$|$))"
match_result_year_mounth_day_one = re.match(regex_year_mounth_day_all,line_year_mounth_day_one)
if match_result_year_mounth_day_one:
    print(match_result_year_mounth_day_one.group(1))
match_result_year_mounth_day_two = re.match(regex_year_mounth_day_all,line_year_mounth_day_two)
if match_result_year_mounth_day_two:
    print(match_result_year_mounth_day_two.group(1))
match_result_year_mounth_day_three = re.match(regex_year_mounth_day_all,line_year_mounth_day_three)
if match_result_year_mounth_day_three:
    print(match_result_year_mounth_day_three.group(1))
match_result_year_mounth_day_four = re.match(regex_year_mounth_day_all,line_year_mounth_day_four)
if match_result_year_mounth_day_four:
    print(match_result_year_mounth_day_four.group(1))
match_result_year_mounth_day_five = re.match(regex_year_mounth_day_all,line_year_mounth_day_five)
if match_result_year_mounth_day_five:
    print(match_result_year_mounth_day_five.group(1))

深度优先和广度优先

网站树结构
深度优先算法和实现(栈)
广度优先算法和实现(队列)

爬虫去重

url存入数据库
url存入set中，需要o(1)的代价查询100000000 * 2byte * 50个字符/1024/1024/1024 = 9G
url经过md5等方法保存set中
用bitmap方法，url通过hash函数映射
bloomfilter方法对bitmap进行改进，多重hash函数降低冲突

原视频UP主慕课网（聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎）
本篇博客撰写人: XiaoJinZi 个人主页转载请注明出处
学生能力有限附上邮箱: 986209501@qq.com 不足以及误处请大佬指责

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,547评论 6赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,399评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,428评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,599评论 1赞 274
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,612评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,577评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,941评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,603评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,852评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,605评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,693评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,375评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,955评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,936评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,172评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,970评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,414评论 2赞 342

第二章 爬虫基础知识

技术选型

网页分类

爬虫能做什么

正则表达式

深度优先和广度优先

爬虫去重

推荐阅读更多精彩内容

第二章爬虫基础知识