700字范文 > 为了探究妹子对内衣的偏好我爬了淘宝内衣店的数据！

为了探究妹子对内衣的偏好我爬了淘宝内衣店的数据！

时间：2020-01-22 06:28:33

作者丨躲猫猫的猫

链接：

/zhaww/p/9636383.html

大家好，希望各位能怀着正直、严谨、专业的心态观看这篇文章。ヾ(๑╹◡╹)ﾉ"

老天真的对我不薄，让我终于有了女友，自从有了女友后，我为了能替女朋友买上一件心怡的内衣，接下来我们尝试用 Python 抓取天猫内衣销售数据，并分析得到中国女性普遍的罩杯数据、最受欢迎的内衣颜色是什么、评论的关键字。我们先看看分析得到的成果是怎样的？（讲的很详细，推荐跟着敲一遍）

（买个内衣这么开心）

图片看不清楚的话，可以把图片单独拉到另一个窗口。这里是分析了一万条数据得出的结论，可能会有误差，但是还是希望单身的你们能找到 0.06% 那一批妹纸。下面我会详细介绍怎么抓取天猫内衣销售数据，存储、分析、展示。

研究天猫网站

抓取天猫评论数据

存储、分析数据

可视化

研究天猫网站

我们随意进入一个商品的购买界面（能看到评论的那个界面），F12 开发者模式 -- Network 栏 -- 刷新下界面 -- 在如图的位置搜索 list_ 会看到一个 list_detail_rate.htm?itemId= ….

如下图：【单击】这个url 能看到返回的是一个 Json 数据，检查一下你会发现这串 Json 就是商品的评论数据 ['rateDetail']['rateList']

【双击】这个url 你会得到一个新界面，如图：

看一下这个信息：

这里的路径就是获取评论数据的 url了。这个 URL 有很多参数你可以分析一下每个值都是干嘛的。

itemId 对应的是商品id， sellerId 对应的是店铺id，currentPage 是当前页。这里 sellerId 可以填任意值，不影响数据的获取。

抓取天猫评论数据

写一个抓取天猫评论数据的方法。getCommentDetail

#获取商品评论数据defgetCommentDetail(itemId,currentPage):url='/list_detail_rate.htm?itemId='+str(itemId)+'&sellerId=2451699564&order=3&currentPage='+str(currentPage)+'&append=0callback=jsonp336'# itemId 产品id ；sellerId 店铺id 字段必须有值，但随意值就行html=common.getUrlContent(url)#获取网页信息#删掉返回的多余信息html=html.replace('jsonp128(','')#需要确定是不是jsonp128html=html.replace(')','')html=html.replace('false','"false"')html=html.replace('true','"true"')#将string转换为字典对象tmalljson=json.loads(html)returntmalljson

这里需要注意的是 jsonp128 这个值需要你自己看一下，你那边跟我这个应该是不同的。

在上面的方法里有两个变量，itemId 和 currentPage 这两个值我们动态来控制，所以我们需要获得一批商品id号和评论的最大页数用来遍历。

写个获取商品评论最大页数的方法 getLastPage

#获取商品评论最大页数defgetLastPage(itemId):tmalljson=getCommentDetail(itemId,1)returntmalljson['rateDetail']['paginator']['lastPage']#最大页数

那现在怎么获取产品的id 列表呢？我们可以在天猫中搜索商品关键字用开发者模式观察

这里观察一下这个页面的元素分布，很容易就发现了商品的id 信息，当然你可以想办法确认一下。

现在就写个获取商品id 的方法 getProductIdList

#获取商品iddefgetProductIdList():url='/search_product.htm?q=内衣'#q参数是查询的关键字html=common.getUrlContent(url)#获取网页信息soup=BeautifulSoup(html,'html.parser')idList=[]#用BeautifulSoup提取商品页面中所有的商品IDproductList=soup.find_all('div',{'class':'product'})forproductinproductList:idList.append(product['data-id'])returnidList

现在所有的基本要求都有了，是时候把他们组合起来。

在 main 方法中写剩下的组装部分

if__name__=='__main__':productIdList=getProductIdList()#获取商品idinitial=0whileinitial<len(productIdList)-30:#总共有60个商品，我只取了前30个try:itemId=productIdList[initial]print('----------',itemId,'------------')maxPage=getLastPage(itemId)#获取商品评论最大页数num=1whilenum<=maxPageandnum<20:#每个商品的评论我最多取20页，每页有20条评论，也就是每个商品最多只取400个评论try:#抓取某个商品的某页评论数据tmalljson=getCommentDetail(itemId,num)rateList=tmalljson['rateDetail']['rateList']commentList=[]n=0while(n<len(rateList)):comment=[]#商品描述colorSize=rateList[n]['auctionSku']m=re.split('[:;]',colorSize)rateContent=rateList[n]['rateContent']dtime=rateList[n]['rateDate']comment.append(m[1])comment.append(m[3])comment.append('天猫')comment.append(rateContent)comment.append(dtime)commentList.append(comment)n+=1print(num)sql="insertintobras(bra_id,bra_color,bra_size,resource,comment,comment_time)value(null,%s,%s,%s,%s,%s)"common.patchInsertData(sql,commentList)#mysql操作的批量插入num+=1exceptExceptionase:num+=1print(e)continueinitial+=1exceptExceptionase:print(e)

所有的代码就这样完成了，我现在把 common.py 的代码，还有 tmallbra.py 的代码都贴出来

#-*-coding:utf-8-*-#Author:zwwimportrequestsimporttimeimportrandomimportsocketimporthttp.clientimportpymysqlimportcsv#封装requestsclassCommon(object):defgetUrlContent(self,url,data=None):header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding':'gzip,deflate,br','Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8','user-agent':"User-Agent:Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/62.0.3202.94Safari/537.36",'cache-control':'max-age=0'}#request的请求头timeout=random.choice(range(80,180))whileTrue:try:rep=requests.get(url,headers=header,timeout=timeout)#请求url地址，获得返回response信息#rep.encoding='utf-8'breakexceptsocket.timeoutase:#以下都是异常处理print('3:',e)time.sleep(random.choice(range(8,15)))exceptsocket.errorase:print('4:',e)time.sleep(random.choice(range(20,60)))excepthttp.client.BadStatusLinease:print('5:',e)time.sleep(random.choice(range(30,80)))excepthttp.client.IncompleteReadase:print('6:',e)time.sleep(random.choice(range(5,15)))print('requestsuccess')returnrep.text#返回的Html全文defwriteData(self,data,url):withopen(url,'a',errors='ignore',newline='')asf:f_csv=csv.writer(f)f_csv.writerows(data)print('write_csvsuccess')defqueryData(self,sql):db=pymysql.connect("localhost","zww","960128","test")cursor=db.cursor()results=[]try:cursor.execute(sql)#执行查询语句results=cursor.fetchall()exceptExceptionase:print('查询时发生异常'+e)#如果发生错误则回滚db.rollback()#关闭数据库连接db.close()returnresultsprint('insertdatasuccess')definsertData(self,sql):#打开数据库连接db=pymysql.connect("localhost","zww","000000","zwwdb")#使用cursor()方法创建一个游标对象cursorcursor=db.cursor()try:#sql="INSERTINTOWEATHER(w_id,w_date,w_detail,w_temperature)VALUES(null,'%s','%s','%s')"%(data[0],data[1],data[2])cursor.execute(sql)#单条数据写入#提交到数据库执行mit()exceptExceptionase:print('插入时发生异常'+e)#如果发生错误则回滚db.rollback()#关闭数据库连接db.close()print('insertdatasuccess')defpatchInsertData(self,sql,datas):#打开数据库连接db=pymysql.connect("localhost","zww","960128","test")#使用cursor()方法创建一个游标对象cursorcursor=db.cursor()try:#批量插入数据#cursor.executemany('insertintoWEATHER(w_id,w_date,w_detail,w_temperature_low,w_temperature_high)value(null,%s,%s,%s,%s)',datas)cursor.executemany(sql,datas)#提交到数据库执行mit()exceptExceptionase:print('插入时发生异常'+e)#如果发生错误则回滚db.rollback()#关闭数据库连接db.close()print('insertdatasuccess')

上面需要注意，数据库的配置。

#-*-coding:utf-8-*-#Author:monimportCommonfrombs4importBeautifulSoupimportjsonimportreimportpymysqlcommon=Common()#获取商品iddefgetProductIdList():url='/search_product.htm?q=内衣'#q参数是查询的关键字，这要改变一下查询值，就可以抓取任意你想知道的数据html=common.getUrlContent(url)#获取网页信息soup=BeautifulSoup(html,'html.parser')idList=[]#用BeautifulSoup提取商品页面中所有的商品IDproductList=soup.find_all('div',{'class':'product'})forproductinproductList:idList.append(product['data-id'])returnidList#获取商品评论数据defgetCommentDetail(itemId,currentPage):url='/list_detail_rate.htm?itemId='+str(itemId)+'&sellerId=2451699564&order=3&currentPage='+str(currentPage)+'&append=0callback=jsonp336'# itemId 产品id ；sellerId 店铺id 字段必须有值，但随意值就行html=common.getUrlContent(url)#获取网页信息#删掉返回的多余信息html=html.replace('jsonp128(','')#需要确定是不是jsonp128html=html.replace(')','')html=html.replace('false','"false"')html=html.replace('true','"true"')#将string转换为字典对象tmalljson=json.loads(html)returntmalljson#获取商品评论最大页数defgetLastPage(itemId):tmalljson=getCommentDetail(itemId,1)returntmalljson['rateDetail']['paginator']['lastPage']#最大页数if__name__=='__main__':productIdList=getProductIdList()#获取商品idinitial=0whileinitial<len(productIdList)-30:#总共有60个商品，我只取了前30个try:itemId=productIdList[initial]print('----------',itemId,'------------')maxPage=getLastPage(itemId)#获取商品评论最大页数num=1whilenum<=maxPageandnum<20:#每个商品的评论我最多取20页，每页有20条评论，也就是每个商品最多只取400个评论try:#抓取某个商品的某页评论数据tmalljson=getCommentDetail(itemId,num)rateList=tmalljson['rateDetail']['rateList']commentList=[]n=0while(n<len(rateList)):comment=[]#商品描述colorSize=rateList[n]['auctionSku']m=re.split('[:;]',colorSize)rateContent=rateList[n]['rateContent']dtime=rateList[n]['rateDate']comment.append(m[1])comment.append(m[3])comment.append('天猫')comment.append(rateContent)comment.append(dtime)commentList.append(comment)n+=1print(num)sql="insertintobras(bra_id,bra_color,bra_size,resource,comment,comment_time)value(null,%s,%s,%s,%s,%s)"common.patchInsertData(sql,commentList)#mysql操作的批量插入num+=1exceptExceptionase:num+=1print(e)continueinitial+=1exceptExceptionase:print(e)

存储、分析数据

所有的代码都有了，就差数据库的建立了。我这里用的是 MySql 数据库。

CREATETABLE`bra`(`bra_id`int(11)NOTNULLAUTO_INCREMENTCOMMENT'id',`bra_color`varchar(25)NULLCOMMENT'颜色',`bra_size`varchar(25)NULLCOMMENT'罩杯',`resource`varchar(25)NULLCOMMENT'数据来源',`comment`varchar(500)CHARACTERSETutf8mb4DEFAULTNULLCOMMENT'评论',`comment_time`datetimeNULLCOMMENT'评论时间',PRIMARYKEY(`bra_id`))charactersetutf8;

这里有两个地方需要注意， comment 评论字段需要设置编码格式为 utf8mb4 ，因为可能有表情文字。还有表需要设置为 utf8 编码，不然存不了中文。

建好了表，就可以完整执行代码了。（这里的执行可能需要点时间，可以做成多线程的方式）。看一下执行完之后，数据库有没有数据。

数据是有了，但是有些我们多余的文字描述，我们可以稍微整理一下。

updatebrasetbra_color=REPLACE(bra_color,'2B6521-无钢圈4-','');updatebrasetbra_color=REPLACE(bra_color,'-1','');updatebrasetbra_color=REPLACE(bra_color,'5','');updatebrasetbra_size=substr(bra_size,1,3);

这里需要根据自己实际情况来修改。如果数据整理的差不多了，我们可以分析一下数据库的信息。

select'A罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%A'unionallselect'B罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%B'unionallselect'C罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%C'unionallselect'D罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%D'unionallselect'E罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%E'unionallselect'F罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%F'unionallselect'G罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%G'unionallselect'H罩杯'as罩杯,CONCAT(ROUND(COUNT(*)/(selectcount(*)frombra)*100,2),"%")as比例,COUNT(*)as销量frombrawherebra_sizelike'%H'orderby销量desc;

（想知道是哪6位小姐姐买的 G (～￣▽￣)～）

数据可视化

数据的展示，我用了是 mycharts 模块，如果不了解的可以去学习一下/#/zh-cn/prepare

这里我就不细说了，直接贴代码看

#encoding:utf-8#monimportCommonif__name__=='__main__':common=Common()results=common.queryData("""selectcount(*)frombrawherebra_sizelike'%A'unionallselectcount(*)frombrawherebra_sizelike'%B'unionallselectcount(*)frombrawherebra_sizelike'%C'unionallselectcount(*)frombrawherebra_sizelike'%D'unionallselectcount(*)frombrawherebra_sizelike'%E'unionallselectcount(*)frombrawherebra_sizelike'%F'unionallselectcount(*)frombrawherebra_sizelike'%G'""")#获取每个罩杯数量attr=["A罩杯",'G罩杯',"B罩杯","C罩杯","D罩杯","E罩杯","F罩杯"]v1=[results[0][0],results[6][0],results[1][0],results[2][0],results[3][0],results[4][0],results[5][0]]pie=Pie("内衣罩杯",width=1300,height=620)pie.add("",attr,v1,is_label_show=True)pie.render('size.html')print('success')results=common.queryData("""selectcount(*)frombrawherebra_colorlike'%肤%'unionallselectcount(*)frombrawherebra_colorlike'%灰%'unionallselectcount(*)frombrawherebra_colorlike'%黑%'unionallselectcount(*)frombrawherebra_colorlike'%蓝%'unionallselectcount(*)frombrawherebra_colorlike'%粉%'unionallselectcount(*)frombrawherebra_colorlike'%红%'unionallselectcount(*)frombrawherebra_colorlike'%紫%'unionallselectcount(*)frombrawherebra_colorlike'%绿%'unionallselectcount(*)frombrawherebra_colorlike'%白%'unionallselectcount(*)frombrawherebra_colorlike'%褐%'unionallselectcount(*)frombrawherebra_colorlike'%黄%'""")#获取每个罩杯数量attr=["肤色",'灰色',"黑色","蓝色","粉色","红色","紫色",'绿色',"白色","褐色","黄色"]v1=[results[0][0],results[1][0],results[2][0],results[3][0],results[4][0],results[5][0],results[6][0],results[7][0],results[8][0],results[9][0],results[10][0]]pieColor=Pie("内衣颜色",width=1300,height=620)pieColor.add("",attr,v1,is_label_show=True)pieColor.render('color.html')print('success')

这一章就到这里了，该知道的你也知道了，不该知道的你也知道了。

哪些人是G杯，哪些妹子是A杯，以后找对象就不用担心这块的了。

代码全部存放在 GitHub上：

/zwwjava/python_capture

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。