Python Web Scraping ———07.31.2017

Three different methods in data-scaring: six.urllib, beautifulsoup, RE-xpath

Just write down what I've learned about web data scraping so that I won't forget everything and start all over next time I need to use the technique.

To work easier with python 2.x, try use lib "six":

from six.moves import urllib

Typical request format would be:

url = ...

hdr = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'} This depends on your laptop spec

req = urllib.request.Request(url, headers=hdr)

doc = urllib.request.urlopen(req).read() This gives you a file of unicodes

Now, it all comes to the choice among different parsing tools that you like to use, beautifulsoup/regular expression… What I have tried is RE-xpath, RE-pattern matching, and beautifulsoup.

For RE-xpath:

IP_ADDRESS_PATH = '//td[2]/text()'

PORT_ADDRESS_PATH = '//tr/td[3]/text()'

You need to understand the html file and know how to construct the xpath towards the notes you like to extract. So for the above IP_ADDRESS_PATH, it's actually saying that starting from the root, find the text of all the third td.

IP_list = list(set(re.findall(IP_ADDRESS_PATH, doc)))

Then use the re.findall() method to find all the contents of nodes you want. Set() makes elements unique and list() turns it back to the list.

** Not sure why this wasn't working, but pretty sure the xpath was constructed correctly since it's verified by some html tester.

For RE-pattern match:

prep = re.compile(r"""<tr\s.*>….\n....</tr>""", re.VERBOSE)

\s means a space in the xpath, \n means a return in the xpath, .* means it represents whatever (could be anything). This summarizes the pattern of the specific block that might be repeated for many times and is under your interest.

proxy_list = prep.findall(doc)

proxy_list = list(set(proxy_list))

proxy_list now contains all the block of codes that have the same pattern.

For beautifulsoup:

You still need six.moves urllib to open up the url.

req = urllib.request.Request(url, headers=hdr)

doc = urllib.request.urlopen(req).read()

soup = bs(doc, 'lxml')

So now you've opened up the html file and can start parsing with the beautiful beautifulsoup.

list1 = [tr.find_all('td') for tr in soup.find_all('tr')]

Okay, three methods that I have learned and tried to parse the html files and extract data. What I ended up doing is with the "bs" and make sure that the method under "bs" is find_all() not findall().

最后编辑于：2017.12.09 15:29:31

Python Web Scraping ———07.31.2017

Okay, three methods that I have learned and tried to parse the html files and extract data. What I ended up doing is with the "bs" and make sure that the method under "bs" is find_all() not findall().

推荐阅读更多精彩内容