最近练习爬虫有点疯魔的感觉,这不,又对百思不得姐上的段子下手了
通过F12查看第一页每个段子的内容都可以通过
'div.j-r-list > ul > li > div.j-r-list-c > div.j-r-list-c-desc'
定位到。
我们要做的,就是把每个段子都拿到并写入到txt文件里面去
详细代码如下:
#!usr/bin/env
# -*-coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup as BS
import time
budejie_url = "http://www.budejie.com/"
first_page_url = "http://www.budejie.com/text/1"
# Set proxy
proxies = {
"http": "http://yourproxy.com:8080/",
"https": "https://yourproxy.com:8080/",
}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
text_of_each_page = ""
for i in range(100):
url_of_each_page = budejie_url + "text/" + str(i+1)
# print url_of_each_page
r = requests.get(url_of_each_page, headers=headers, proxies=proxies)
# print r.status_code
if r.status_code == 200:
soup = BS(r.text, "lxml")
text_lists = soup.select('div.j-r-list > ul > li > div.j-r-list-c > div.j-r-list-c-desc')
for text_of_duanzi in text_lists:
text_of_each_page += text_of_duanzi.get_text()
time.sleep(3)
else:
continue
myfile = open("budejie.txt", "w")
myfile.write(text_of_each_page.encode('utf-8'))
myfile.close()