I planned 1 month ago I should write an article for Python. This is the first time I met a new language after I left IT 5 years ago. I know how to start a new languarge qucikly because I am familiar with Java, C++,C# etc. So I decided to develop a web crawler program instead of the standard “hello world”. At the same time I'd like to do some sales analysis for DGBB(Deep groove ball bearing) which is the catalog bearing for retail market. So I will combine these 2 things together and make an analysis tool for this business by the data from Taobao.
So the contents are listed below:
1) Result : Data analysis and reports.
2) What's the logic of this tool?
3) Source code of Python.
1)Data analysis and reports.
Step 1: Data crawled from mobile apps of taobao.
At the begining I want to get the data from taobao website, but they are anti-crawler. So I check the experience from internet. Some guys said maybe can get the data through the mobile app. It's a HTML document also. It works at last but the data maybe not completed enough, but it's enough for some static analysis.
Step 2: Reports
You can find that the top 3 are Shanghai, Zhejiang and Jiangsu. That means the major market is in the Yangtze River Delta
(YRD).
This market share report is base on the sales. We can find the most sales are happed in YRD also. Why there is almost nothing saled in Pearl River Delta? I guess maybe YRD is focus on the upstream and mid-stream industry and PRD is focus on the downstream induatry.
From this brand report, we can find that the biggest local brand(HRB) has already got 39% market share. NSK is the second one.
2) Logic of of this tool
Step 1: Use the tool of explorer to idendify the connection with server. As the header elements the IE will send to server together with the URL. So for avoiding the mis-connection with server, Python will prepare the header information in advance.
Step 2: Generate the URL for taobao to crawl the necessary data for DGBB such as location, production name, store name and sales quantity etc. If the program will dig in deeper we will keep the URL in the list. (In excel or JSON)
Step 3: If the URLs are not finished all, program will get one URL to download the page. And all the new URLs will be append to the URL list if the URL has not been included yet. Program will adapt the pages to JSON structure and write the necessary fields in the EXCEL file.
Step 4: Create sales report by Python.
3) Source code of Python.
Step 1: Pre-condition of the program.
1. Excel enhancement package xlsxwriter. Intall method:
PIP3 install xlsxwriter
2. If you want to store the images please include the image package. I installed failure so I skip the pictures of the products.
Step 2: Define a function for download the pages.
def getHtml(url,pro='',postdata={}):
#download the html:support cookie
#first argument is the url ,second argument is the post data.
filename = 'cookie.txt'
# declare a MozillaCookieJar object in the file
cj = http.cookiejar.MozillaCookieJar(filename)
proxy_support = urllib.request.ProxyHandler({'http':'http://'+pro})
# open the header information to cheat the server of taobao.
opener.addheaders = [('User-Agent','Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'),('Referer','http://s.m.taobao.com'),('Host', 'h5.m.taobao.com'),('Cookie',cookie)]
# open the url
urllib.request.install_opener(opener)
if postdata:
postdata = urllib.parse.urlencode(postdata)
html_bytes = urllib.request.urlopen(url, postdata.encode()).read()
else:
html_bytes = urllib.request.urlopen(url).read()
cj.save(ignore_discard=True, ignore_expires=True)
return html_bytes
Step 3: Define a function to write data to Excel file.
def writeexcel(path,dealcontent):
workbook = wx.Workbook(path)
worksheet = workbook.add_worksheet()
for j in range(0,len(dealcontent[i])):
if i!=0 and j==len(dealcontent[i])-1:
if dealcontent[i][j]=='':
worksheet.write(i,j,' ',)
else:
try:
worksheet.insert_image(i,j,dealcontent[i][j])
except:
worksheet.write(i,j,' ',)
else:
if dealcontent[i][j]:
worksheet.write(i,j,dealcontent[i][j].replace(' ',''),)
else:
worksheet.write(i,j,'',)
workbook.close()
Step 4: Write a main program.
def begin():
if __name__ == '__main__':
begin()
today=time.strftime('%Y%m%d', time.localtime())
a=time.clock()
keyword = input('Key words:')
sort = input('Sort by sales 1,Sort by price 2,Sort by price 3,Sort by credit 4,Sort by overall 5:')
try:
pages =int(input('Pages want to crawl(default 100 pages):'))
if pages>100 or pages<=0:
print('Page number should be in 1 to 100)
pages=100
except:
pages=100
try:
man=int(input(time suspend:default 4 seconds(4):'))
if man<=0:
man=4
except:
man=4
if sort == '1':
sortss = '_sale'
elif sort == '2':
sortss = 'bid'
elif sort=='3':
sortss='_bid'
elif sort=='4':
sortss='_ratesum'
elif sort=='5':
sortss=''
else:
sortss = '_sale'
namess=time.strftime('%Y%m%d%H%S', time.localtime())
root = '../data/'+today+'/'+namess+keyword
roota='../excel/'+today
mulu='../image/'+today+'/'+namess+keyword
createjia(root)
createjia(roota)
for page in range(0, pages):
time.sleep(man)
print('Suspend+str(man)+'second)
if sortss=='':
postdata = {
'event_submit_do_new_search_auction': 1,
'search': 'provide the search,
'_input_charset': 'utf-8',
'topSearch': 1,
'atype': 'b',
'searchfrom': 1,
'action': 'home:redirect_app_action',
'from': 1,
'q': keyword,
'sst': 1,
'n': 20,
'buying': 'buyitnow',
'm': 'api4h5',
'abtest': 16,
'wlsort': 16,
'style': 'list',
'closeModues': 'nav,selecthot,onesearch',
'page': page
}
else:
postdata = {
'event_submit_do_new_search_auction': 1,
'search': 'provide the searches,
'_input_charset': 'utf-8',
'topSearch': 1,
'atype': 'b',
'searchfrom': 1,
'action': 'home:redirect_app_action',
'from': 1,
'q': keyword,
'sst': 1,
'n': 20,
'buying': 'buyitnow',
'm': 'api4h5',
'abtest': 16,
'wlsort': 16,
'style': 'list',
'closeModues': 'nav,selecthot,onesearch',
'sort': sortss,
'page': page
}
postdata = urllib.parse.urlencode(postdata)
taobao = "http://s.m.taobao.com/search?" + postdata
print(taobao)
try:
content1 = getHtml(taobao)
file = open(root + '/' + str(page) + '.json', 'wb')
file.write(content1)
except Exception as e:
if hasattr(e, 'code'):
print('Pages not exist or timeout.')
print('Error code:', e.code)
elif hasattr(e, 'reason'):
print("Can't connect the server.")
print('Reason: ', e.reason)
else:
print(e)
files = listfiles(root, '.json')
total = []
total.append(['页数', '店名', '商品标题', '商品打折价', '发货地址', '评论数', '原价', '售出件数', '政策享受', '付款人数', '金币折扣','URL地址','图像URL','图像'])
for filename in files:
try:
doc = open(filename, 'rb')
doccontent = doc.read().decode('utf-8', 'ignore')
product = doccontent.replace(' ', '').replace('\n', '')
product = json.loads(product)
onefile = product['listItem']
except:
print("Can't get files"+ filename)
continue
for item in onefile:
itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']]
itemlist.append(item['originalPrice'])
# itemlist.append(item['mobileDiscount'])
itemlist.append(item['sold'])
itemlist.append(item['zkType'])
itemlist.append(item['act'])
itemlist.append(item['coinLimit'])
itemlist.append('http:'+item['url'])
total.append(itemlist)
if len(total) > 1:
writeexcel(roota +'/'+namess+keyword+ 'taobao.xlsx', total)
else:
print('nothing got from server')
b=time.clock()
print('run time:'+timetochina(b-a))
Refer to source code from "一只尼玛"