第一次学习异步加载的网页如何找出真实网页,看了一下午,实在是有点困难。但是就是有这么个毛病,越是找不到的就越想找到。
到现在终于找到了我要的真实网址,泪奔。。。
我们以黄山为例:在输入黄山之后,得到的评论如下图所示:
什么叫异步加载,就是我在选取评论语言的时候,上面的网址是不会变的,说明有猫腻。
我在首先明白了什么叫抓包,以及怎么去抓包之后就开始了漫长的找包之旅,过程就不赘述了,
首先发现在起始网页中加入浏览器信息的时候是可以解析出英文界面的,但是!!!
在这里有一个更多,又是一个异步加载!还得接着找。
在开发者工具里点击 clear
在多次点击更多之后,发现出来一个这个玩意
到此结束了?
肯定并没有,那些一长串的数字是怎么来的? 下一篇再介绍。 to be continue...
照例,附上单独解析的代码:
import requests
from lxml import etree
url='http://www.tripadvisor.cn/ExpandedUserReviews-g303685-d550738?target=410115359&context=1&reviews=410115359,409344604,407255372,401140048,400179383,398229741,396111020,395334568,394200191,393782571&servlet=Attraction_Review&expand=1'
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'ServerPool=X; TATravelInfo=V2*A.2*MG.-1*HP.2*FL.3*RVL.550738_100*RS.1; TASSK=enc%3AAGMMZ%2Bwe98u9po0Y%2FIY8pNbyuAGi9fbnqnNLKXa4%2BK5cWP0RMuCHTRZhu0uFf1yydRIPPAQ%2FpF7EdW0NLOpBZZId19ek1a9GHWZKvnuTIJ0QcXx1ULQXtiMx%2F%2BHhNCUrIg%3D%3D; TAUnique=%1%enc%3AjrXWw0qqncCEQMzfl5keG315t9yL8iOg6jLwcPiP6q8%3D; _jzqckmp=1; bdshare_firstime=1491815789350; __gads=ID=e5060e1a6b1ed08f:T=1491815796:S=ALNI_MbFkpxx2-zq7ubsIoe4wvdJnbQWoA; TALanguage=en; TAReturnTo=%1%%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html; TASession=%1%V2ID.DA0C735ECBB05FFBD2F31EA11943410C*SQ.15*LP.%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui%5C.html*LS.Attraction_Review*GR.70*TCPAR.53*TBR.19*EXEX.62*ABTR.65*PHTB.78*FS.82*CPU.26*HS.popularity*ES.popularity*AS.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.en*FA.1*DF.0*MS.-1*RMS.-1*FLO.550738*TRA.false*LD.550738; CM=%1%HanaPersist%2C%2C-1%7CPremiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CHanaSession%2C%2C-1%7CRCPers%2C%2C-1%7CWShadeSeen%2C%2C-1%7CFtrPers%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7Csesscoestorem%2C%2C-1%7CCpmPopunder_1%2C1%2C1491902222%7CCCSess%2C%2C-1%7CCpmPopunder_2%2C1%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7CPremiumORSess%2C%2C-1%7Ct4b-sc%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7Cperscoestorem%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CSaveFtrPers%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7CMetaFtrSess%2C%2C-1%7CRBAPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_PERSISTANT%2C%2C-1%7CFtrSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7Cbookstickcook%2C%2C-1%7Csh%2C%2C-1%7CLastPopunderId%2C137-1859-null%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7C2016sticksess%2C%2C-1%7CCCPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_SESSION%2C%2C-1%7Cb2bmcsess%2C%2C-1%7C2016stickpers%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CSaveFtrSess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CRBASess%2C%2C-1%7Cbookstickpers%2C%2C-1%7Cperssticker%2C%2C-1%7CMetaFtrPers%2C%2C-1%7C; TAUD=LA-1491815815299-1*LG-14277644-2.1.F.*LD-14277645-.....; roybatty=TNI1625!AP9YRq1oHIHfPtXcJCINRrDe7hLPCe8L8uurjbOYo996M1NrdEF3UC8F2w%2BA%2FvgIK20Ptfm2qFK2Y7gBNq3fPyswrYVGd%2BwBp%2FhQTse54C7MDQU3%2FCl9pe%2FrrYw8WiSNYgQ6pewgJ',
'Host': 'www.tripadvisor.cn',
'Referer': 'http://www.tripadvisor.cn/Attraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
}
html=requests.post(url,headers=headers).content
selector=etree.HTML(html)
infos = selector.xpath('//div[@class="entry"]')
print(len(infos))
for info in infos:
comment = info.xpath('p/text()')[0]
print(comment)