拉钩招聘信息爬取以及可视化

本篇文章主要向读者介绍如何爬取像lagou这样具有反爬虫网站上面的招聘信息，以及对于以获取的数据进行可视化处理，如果，我们对于获取的数据不进行可视化处理，那我们获取到的数据就没有发挥它应有的作用。对于数据的获取以及存储我们用到了time、requests、pymysql这三个第三方库；对于数据可视化我们使用到了matplotlib库。

阅读本文章你可能需要的基础/能力

能够对数据库进行基础的操作
能够处理json数据类型
对于爬虫有一定的了解
熟悉requests的请求方法
能够熟悉运用matplotlib库

一、首先还是和我们平时爬虫做一样的操作，找链接：
输入网址，在搜索框中输入python,本次主要是对python的职业进行数据可视化：

F12查看页面元素，找到Network，点击页面中的第一个数据包


你会发现你要的数据并不在那个数据包里面；我们继续找，换到XHR中，点击第一个数据包，你会发现你要的数据是以一个json的形式存储的

对于以json形式存储的数据我们一个如何获取，我会在代码里面进行演示，注解；
二、通过第一步，我们知道我们想要的数据是在什么位置，下面我们就要明白，如何获取到数据；下面我们将鼠标移步到Headers：

上图中用红色标注的就是获取该页面应该带上的数据

获取第一页数据代码演示：

import time
import requestsif __name__=="__main__":data = {'first': 'true','pn': '1','kd': 'python'}url = 'https://www.lagou.com/jobs/positionAjax.json'headers = {'origin': 'https://www.lagou.com','referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','cookie': xxx}params = {'needAddtionalResult': 'false'}response = requests.post(url=url, headers=headers, params=params, data=data)data = response.json()results = data["content"]["positionResult"]["result"]for result in results:city = result["city"]companyFullName = result["companyFullName"]createTime = result["createTime"]education = result["education"]firstType = result["firstType"]workYear = result["workYear"]district = result["district"]print(city, companyFullName, createTime, education, firstType, workYear, district)

三、下面我们就开始爬取多个页面的数据，竟然是爬取多个页面的数据，那就必然，有一些东西是不同，用来区分每个页面的；
我们一起来找下规律吧！当我们点击下面的1 、2、 3页码进行换页时，页面的左侧会加载很多 positionAjax.json?needAddtionalResult=false的数据包，我们每个点进去看一下，就很容易发现每个数据包的不同之处，它是根据data中的pn对应不同的数字进行换页的，这个就提示我们，可以通过pn的不同来切换页面

获取多页数据代码并将数据放入数据库中代码演示：

import time
import requests
import pymysql
conn=pymysql.connect(host="xxx",port=3306,passwd="xxx",db="xxx",user="xxx")
cursor=conn.cursor()
url = 'https://www.lagou.com/jobs/positionAjax.json'
headers = {'origin': 'https://www.lagou.com','referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','cookie':xxx}
params = {'needAddtionalResult': 'false'}
infor_list=[]
def get_infor(page_number):time.sleep(5)try:for i in range(1,page_number+1):data = {'first': 'true','pn': str(i),'kd': 'python'}response = requests.post(url=url, headers=headers, params=params, data=data)data=response.json()results=data["content"]["positionResult"]["result"]for result  in results:city=result["city"]companyFullName=result["companyFullName"]createTime=result["createTime"]education=result["education"]firstType=result["firstType"]workYear=result["workYear"]district=result["district"]cursor.execute("insert into lagou values (%s,%s,%s,%s,%s,%s,%s)",(str(city),str(companyFullName),str(createTime),str(education),str(firstType),str(workYear),str(district)))print(city,companyFullName,createTime,education,firstType,workYear,district)except KeyError:pass#     定义获取页面数量函数
def get_page(data):time.sleep(3)response = requests.post(url=url, headers=headers, params=params, data=data)data = response.json()page_count = data["content"]["positionResult"]["totalCount"]page_number=int(page_count/15) if page_count/15<30 else 30get_infor(page_number)def main():start_time=time.time()data = {'first': 'true','pn': '1','kd': 'python'}get_page(data)conn.commit()end_time=time.time()print("共花费{}时间完成爬取!".format(end_time-start_time))
if __name__=="__main__":main()

效果图：

因为拉钩有反爬虫机制，所以我们在进行数据的爬取时应该加上time模块限制爬虫速度，否则，会封IP地址；

四、数据我们已经有了，现在我们就应该对拿到的数据进行分析，充分发挥数据应有的作用；我制作了一个有关不同学历在python这个职业的分布情况；

直接上代码：

#coding=gbk
import matplotlib as mpl
import matplotlib.pyplot as plt
import pymysql
# 解决乱码问题
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
conn=pymysql.connect(user=xxx,passwd=xxx,port=3306,db="xxx",host="xxx")
cousor=conn.cursor()
# 定义显示饼图函数
def display_pie():city_list=cousor.execute("select education from lagou")education_data=cousor.fetchall().__str__()  #此处进行字符类型转换data=education_data.replace("('","").replace("',)","").replace("',)","").replace(")","").replace("(","").split(",")under_count=0unlimit_count=0junior_count=0master_count=0doctor_count=0labels=[]for i in range(0,len(data)):# print(data[i])if (data[i]).strip() == "本科":under_count = under_count + 1if "本科" in labels:passelse:labels.append(data[i].strip())elif (data[i]).strip() == '不限':unlimit_count = unlimit_count + 1if "不限" in labels:passelse:labels.append(data[i].strip())elif (data[i]).strip() == '大专':junior_count = junior_count + 1if "大专" in labels:passelse:labels.append(data[i].strip())elif (data[i]).strip() == '硕士':master_count = master_count + 1if "硕士" in labels:passelse:labels.append(data[i].strip())number_list = [under_count, unlimit_count, junior_count, master_count]# print(labels)plt.pie(number_list,autopct="%3.1f%%",shadow=True,labels=labels)plt.title("教育程度产生就业差度化饼图")plt.show()
if __name__=="__main__":display_pie()