诀窍,大局观
- 找“打印该页面链接”,找“移动端显示”,会让格式更容易
- 找在js里的信息
- 信息可能在url里
- 换个网站找同样信息
get_text()
去掉所有tag部分,只留下text部分。留到最后再用这个功能。
pythonnameList = bsObj.findAll("span", {"class":"green"})for name in nameList: print(name.get_text())
findAll()pythonfindAll(tag, attributes, recursive, text, limit, keywords).findAll({"h1","h2","h3","h4","h5","h6"}) # 找tag属于的.findAll("span", {"class":"green", "class":"red"}) # 找tag=span,class属于的nameList = bsObj.findAll(text="the prince") # 找tag的text是“the price”的个数allText = bsObj.findAll(id="text") # keywords寻找对应关键词的allText = bsObj.findAll("", {"id":"text"}) # 与上式同义bsObj.findAll(class_="green") # class关键词时用class_,避免关键词soup.findAll(lambda tag: len(tag.attrs) == 2) # 加lambda表达式
children(), descendants()pythonbsObj.find("tr",{"id":"gift1"}).children() # 满足条件tag的直属一级tagbsObj.find("tr",{"id":"gift1"}).descendants() # 满足条件tag的包含的所有tag
next_siblings, previous_siblingspythonbsObj.find("table",{"id":"giftList"}).tr.next_siblings # 当前tr tag之后的并列tagbsObj.find("table",{"id":"giftList"}).previous_siblings # 当前tag之前的并列tag
parentpythonbsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text() # 定位到当前tag的parent
regular expressionspythonimages = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) # findAll加re
获取tag属性attributespythonmyImgTag.attrs # 得到字典,包括这个tag的所有属性myImgTag.attrs['src'] # src属性值
其他选择,不用bs41. lxml:处理HTML,XML,很快。2. HTML Parser:buit-in