学习爬虫第3天,爬取霉霉照片。
代码如下:
#!/usr/bin/env python
# coding:utf-8
__author__ = 'lucky'
from bs4 import BeautifulSoup
import requests,urllib.request
import time
urls = ['http://weheartit.com/inspirations/taylorswift?scrolling=true&page={}'.format(number) for number in range(1,21)]
header = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Cookie':'locale=zh-cn; __whiAnonymousID=cedf556d59434a78a518b36279b59bd4; auth=no; _session=06742c2ee80e676adfa76366d2b522ed; _ga=GA1.2.1879005139.1467165244; _weheartit_anonymous_session=%7B%22page_views%22%3A1%2C%22search_count%22%3A0%2C%22last_searches%22%3A%5B%5D%2C%22last_page_view_at%22%3A1467202282156%7D'}
img_links = []
def get_links(url,data=None):
wb_data = requests.get(url,headers=header)
Soup = BeautifulSoup(wb_data.text,'lxml')
imgs = Soup.select('body > div > div > div > a > img')
if data == None:
for img in imgs:
img_links.append(img.get('src'))
for url in urls:
get_links(url) #获取图片的link
print('OK')
i = 0 #图片名称: i.jpg
folder_path ='/Users/lucky/Life/pic/'
for img in img_links:
urllib.request.urlretrieve(img,folder_path+str(i)+'.jpg')
i+=1
print('Done')
获取图片如下:
单独照片:
总结:
1.更加熟练的调用函数来
2.添加header,伪装成浏览器,添加cookies来爬取相关网页。
3.Download图片用到了urllib.request,及urllib.request.urlretrieve()这个函数,此函数调用了open('filename','wb')这样的函数。
4.更好的理解了CSS和HTML网页元素的位置抓取和chrome浏览器的使用。