This time I am learning to write a crawling program to crawl the async loading page and download all the images from the website into the local PC end. By this way, the contents of the page will be loaded dynamically, which means the new content will the loaded every time you scrolled and touch the bottom of the page.
Here is the website I am crawling:
Here is the code:
From this session, I have learnt some skills as below:
1)
A whole picture of the coding structure is very critical to write good codes. The whole picture of the main program could be dividend into three parts:
- import all the modules either from the third party or your own modules
- compose main codes including all the functions to form a working flow, and then form a main() function
- " if __name__ == "__main()__": main() to start the whole program
2)
A proxy or even user agent should be inserted into the crawling program to ensure the crawling process:
"r = requests.get(url, proxies=proxies, headers=headers)"
"headers" come from the html file, and proxies come in the form of "proxies = {"http": "127.0.0.1:8888}". For my program, I am using a public university VPN so I realized I don't need to use any proxy and user agent.
However, I ran into a trouble that the crawling process was terminated after some time. I need to solve this problem in the future.
3)
I learnt how to download an image into the local PC. The critical code should be:
"""
for page in range(1,10):
url = "base_url{}".format(page)
if r.status_code != 200:
continue
soup = BeautifulSoap(r.text, 'html.parser')
imgs = soup.select('css selector')
for img in imgs:
src = soup.select('css selector)
download(src)
"""
"""
def download(url):
r = requests.get(url, proxies=proxies, headers=headers)
if r.status_code != 200:
return
filename = url.split('?')[0].split('/')[-2]
target = "./{}.jpg".format(filename)
with open(target, 'wb') as fs:
fs.write(r.content)
print("%s => %s" %(url,target))
"""