700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > JAVA爬虫 - 爬取豆瓣租房信息

JAVA爬虫 - 爬取豆瓣租房信息

时间:2020-07-10 10:49:59

相关推荐

JAVA爬虫 - 爬取豆瓣租房信息

最近打算换房子,豆瓣上面的租房小组相对来说较为真实,但是发现搜索功能不是那么友好,所以想把帖子的数据都爬到数据库,自己写sql语句去筛选,开搞!

每步过程都贴上完整代码,感兴趣的可以看下过程,没时间的同学直接拉到最下复制最终的代码去试试看也OK。

一、获取每页的url

首先分析URL的规律。

链接:龙岗租房小组

第一页:

第二页:

很容易发现参数start代表的是每页帖子条数的开始,每页显示25行。我们就可以同过这个参数,写一个循环,每次增加25来获取每一页的内容。

注意:循环内需要try捕捉异常,这里设置的连接超时时间为10秒,如果因为网络的原因超时,会抛出异常,中断循环。使用try块后顶多这一页的数据不要了。如果想完整的收集数据,可以在catch块让pageStrat- 25 然后进入下个循环再次访问这页。

package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000);//连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();System.out.println(returnStr);}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}

运行程序,能得到每一页的整个html,存在变量returnStr中。

二、从每页的html中取得帖子详情的url

接下来需要分析每个帖子详情的url,利用正则表达式提取

这里用到的是这两个类

import java.util.regex.Matcher;import java.util.regex.Pattern;

正则表达式这块就自己去学把。下面贴出获取到匹配到url的代码。

package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000); //连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {System.out.println("帖子详情链接:" + m.group(1));}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}

运行打印日志:

三、进入帖子详情页,抓取文章标题和文章内容。

其实这步跟第一步的操作一样。循环访问帖子详情页就行了。然后分析帖子详情页的内容。

package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000); //连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {Thread.sleep(1000);try {String tempUrlStr = m.group(1);System.out.println("当前链接:" + tempUrlStr);URL tempUrl = new URL(tempUrlStr);HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();//设置请求方式tempConnection.setRequestMethod("GET");// 10秒超时tempConnection.setConnectTimeout(10000);tempConnection.setReadTimeout(10000);//连接tempConnection.connect();//得到响应码int tempResponseCode = tempConnection.getResponseCode();if(tempResponseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream tempInputStream = tempConnection.getInputStream();//获取响应BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));String tempReturnStr = "";String tempLine;while ((tempLine = tempReader.readLine()) != null){tempReturnStr += tempLine + "\r\n";}Pattern p2 = pile("\"text\": \"([^\"]*)\",\r\n" + "\"name\": \"([^\"]*)\",\r\n" + "\"url\": \"([^\"]*)\",\r\n" + " \"commentCount\": \"[^\"]*\",\r\n" + " \"dateCreated\": \"([^\"]*)\",");Matcher m2 = p2.matcher(tempReturnStr); while(m2.find()) {System.out.println(m2.group(1));System.out.println(m2.group(2));System.out.println(m2.group(3));System.out.println(m2.group(4));}tempReader.close();tempInputStream.close();tempConnection.disconnect();}}catch(Exception e) {e.printStackTrace();}}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}

分析这块的html代码可以看到

嗯,写正则!但是在看html的过程中,发现了有段JS代码直接包含信息了

开心,相比前面隔得比较远,这里的数据就相对来说集中写,正则也比较好写,于是就用了上面代码贴出来的正则去匹配了

打印一下:

基本上到这就算OK了。接下来只要把标题、内容、url、发帖时间存入数据库就完成了。

四、反爬虫!!!

跑起来后,原本以为没反爬虫,但是发现爬了700多条后,IP被禁了。然后停了两天,解封后再次尝试。

在请求中Request Headers加入cookie,Host,Referer,User-Agent参数,这些参数可以直接打开豆瓣页面取。如果还会被封,那就登陆后再复制页面中的cookie。

因为有的网站反爬虫策略是根据这几个请求头内的属性判断的,请求头内的几个参数最好多了解了解,有助于绕过网站的反爬虫策略。

另外,可以在循环中设置停顿,Thread.sleep(1000);

因为有的反爬虫策略是限制一段时间内的请求数量,同时停顿也可以防止并发过大给目标网站带来压力,毕竟咱们是取数据而不是攻击目标。平时做爬虫也最好是放在半夜跑,以免跟正常用户抢占资源啦。

五、完整代码

这里贴上最后的版本,加上了绕过反爬虫的请求头参数,用jdbc连接数据库持久化数据。

我这边只存了标题,正文,原文URL,发布时间这四个字段。

package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.Connection;import java.sql.DriverManager;import java.sql.SQLException;import java.sql.Statement;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {Class.forName("com.mysql.jdbc.Driver");String sqlUrl = "jdbc:mysql://localhost:3306/douban?characterEncoding=UTF-8";Connection conn = DriverManager.getConnection(sqlUrl, "root", "VisionKi");Statement stat = conn.createStatement();int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000);connection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");connection.setRequestProperty("Host", "");connection.setRequestProperty("Referer", "/group/longgangzufang/discussion?start=25");connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");//连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {Thread.sleep(500);try {String tempUrlStr = m.group(1);System.out.println("当前链接:" + tempUrlStr);URL tempUrl = new URL(tempUrlStr);HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();//设置请求方式tempConnection.setRequestMethod("GET");// 10秒超时tempConnection.setConnectTimeout(10000);tempConnection.setReadTimeout(10000);tempConnection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");tempConnection.setRequestProperty("Host", "");tempConnection.setRequestProperty("Referer", "/group/longgangzufang/discussion?start=25");tempConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");//连接tempConnection.connect();//得到响应码int tempResponseCode = tempConnection.getResponseCode();if(tempResponseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream tempInputStream = tempConnection.getInputStream();//获取响应BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));String tempReturnStr = "";String tempLine;while ((tempLine = tempReader.readLine()) != null){tempReturnStr += tempLine + "\r\n";}Pattern p2 = pile("\"text\": \"([^\"]*)\",\r\n" + "\"name\": \"([^\"]*)\",\r\n" + "\"url\": \"([^\"]*)\",\r\n" + " \"commentCount\": \"[^\"]*\",\r\n" + " \"dateCreated\": \"([^\"]*)\",");Matcher m2 = p2.matcher(tempReturnStr); while(m2.find()) {//System.out.println(m2.group(1));//System.out.println(m2.group(2));//System.out.println(m2.group(3));//System.out.println(m2.group(4));stat.executeUpdate("INSERT INTO house(title,content,time,house_url) VALUES ('" + m2.group(2).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(1).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(4).replace("T"," ") + "','" + m2.group(3) + "');");}tempReader.close();tempInputStream.close();tempConnection.disconnect();}}catch(Exception e) {e.printStackTrace();}}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}

爬了一会4000多条了,没有再被禁ip。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。