最近打算换房子,豆瓣上面的租房小组相对来说较为真实,但是发现搜索功能不是那么友好,所以想把帖子的数据都爬到数据库,自己写sql语句去筛选,开搞!
每步过程都贴上完整代码,感兴趣的可以看下过程,没时间的同学直接拉到最下复制最终的代码去试试看也OK。
一、获取每页的url
首先分析URL的规律。
链接:龙岗租房小组
第一页:
第二页:
很容易发现参数start代表的是每页帖子条数的开始,每页显示25行。我们就可以同过这个参数,写一个循环,每次增加25来获取每一页的内容。
注意:循环内需要try捕捉异常,这里设置的连接超时时间为10秒,如果因为网络的原因超时,会抛出异常,中断循环。使用try块后顶多这一页的数据不要了。如果想完整的收集数据,可以在catch块让pageStrat- 25 然后进入下个循环再次访问这页。
package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000);//连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();System.out.println(returnStr);}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}
运行程序,能得到每一页的整个html,存在变量returnStr中。
二、从每页的html中取得帖子详情的url
接下来需要分析每个帖子详情的url,利用正则表达式提取
这里用到的是这两个类
import java.util.regex.Matcher;import java.util.regex.Pattern;
正则表达式这块就自己去学把。下面贴出获取到匹配到url的代码。
package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000); //连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {System.out.println("帖子详情链接:" + m.group(1));}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}
运行打印日志:
三、进入帖子详情页,抓取文章标题和文章内容。
其实这步跟第一步的操作一样。循环访问帖子详情页就行了。然后分析帖子详情页的内容。
package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.SQLException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main2 {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000); //连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {Thread.sleep(1000);try {String tempUrlStr = m.group(1);System.out.println("当前链接:" + tempUrlStr);URL tempUrl = new URL(tempUrlStr);HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();//设置请求方式tempConnection.setRequestMethod("GET");// 10秒超时tempConnection.setConnectTimeout(10000);tempConnection.setReadTimeout(10000);//连接tempConnection.connect();//得到响应码int tempResponseCode = tempConnection.getResponseCode();if(tempResponseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream tempInputStream = tempConnection.getInputStream();//获取响应BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));String tempReturnStr = "";String tempLine;while ((tempLine = tempReader.readLine()) != null){tempReturnStr += tempLine + "\r\n";}Pattern p2 = pile("\"text\": \"([^\"]*)\",\r\n" + "\"name\": \"([^\"]*)\",\r\n" + "\"url\": \"([^\"]*)\",\r\n" + " \"commentCount\": \"[^\"]*\",\r\n" + " \"dateCreated\": \"([^\"]*)\",");Matcher m2 = p2.matcher(tempReturnStr); while(m2.find()) {System.out.println(m2.group(1));System.out.println(m2.group(2));System.out.println(m2.group(3));System.out.println(m2.group(4));}tempReader.close();tempInputStream.close();tempConnection.disconnect();}}catch(Exception e) {e.printStackTrace();}}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}
分析这块的html代码可以看到
嗯,写正则!但是在看html的过程中,发现了有段JS代码直接包含信息了
开心,相比前面隔得比较远,这里的数据就相对来说集中写,正则也比较好写,于是就用了上面代码贴出来的正则去匹配了
打印一下:
基本上到这就算OK了。接下来只要把标题、内容、url、发帖时间存入数据库就完成了。
四、反爬虫!!!
跑起来后,原本以为没反爬虫,但是发现爬了700多条后,IP被禁了。然后停了两天,解封后再次尝试。
在请求中Request Headers加入cookie,Host,Referer,User-Agent参数,这些参数可以直接打开豆瓣页面取。如果还会被封,那就登陆后再复制页面中的cookie。
因为有的网站反爬虫策略是根据这几个请求头内的属性判断的,请求头内的几个参数最好多了解了解,有助于绕过网站的反爬虫策略。
另外,可以在循环中设置停顿,Thread.sleep(1000);
因为有的反爬虫策略是限制一段时间内的请求数量,同时停顿也可以防止并发过大给目标网站带来压力,毕竟咱们是取数据而不是攻击目标。平时做爬虫也最好是放在半夜跑,以免跟正常用户抢占资源啦。
五、完整代码
这里贴上最后的版本,加上了绕过反爬虫的请求头参数,用jdbc连接数据库持久化数据。
我这边只存了标题,正文,原文URL,发布时间这四个字段。
package douban;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import .HttpURLConnection;import .URL;import java.sql.Connection;import java.sql.DriverManager;import java.sql.SQLException;import java.sql.Statement;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Main {public static String DOU_BAN_URL = "/group/longgangzufang/discussion?start={pageStart}";public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {Class.forName("com.mysql.jdbc.Driver");String sqlUrl = "jdbc:mysql://localhost:3306/douban?characterEncoding=UTF-8";Connection conn = DriverManager.getConnection(sqlUrl, "root", "VisionKi");Statement stat = conn.createStatement();int pageStrat = 0;while(true) {try {URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));System.out.println("当前页面:" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));HttpURLConnection connection = (HttpURLConnection)url.openConnection();//设置请求方式connection.setRequestMethod("GET");// 10秒超时connection.setConnectTimeout(10000);connection.setReadTimeout(10000);connection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");connection.setRequestProperty("Host", "");connection.setRequestProperty("Referer", "/group/longgangzufang/discussion?start=25");connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");//连接connection.connect();//得到响应码int responseCode = connection.getResponseCode();if(responseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream inputStream = connection.getInputStream();//获取响应BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));String returnStr = "";String line;while ((line = reader.readLine()) != null){returnStr+=line + "\r\n";}//该干的都干完了,记得把连接断了reader.close();inputStream.close();connection.disconnect();//System.out.println(returnStr);Pattern p = pile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");Matcher m = p.matcher(returnStr); while(m.find()) {Thread.sleep(500);try {String tempUrlStr = m.group(1);System.out.println("当前链接:" + tempUrlStr);URL tempUrl = new URL(tempUrlStr);HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();//设置请求方式tempConnection.setRequestMethod("GET");// 10秒超时tempConnection.setConnectTimeout(10000);tempConnection.setReadTimeout(10000);tempConnection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");tempConnection.setRequestProperty("Host", "");tempConnection.setRequestProperty("Referer", "/group/longgangzufang/discussion?start=25");tempConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");//连接tempConnection.connect();//得到响应码int tempResponseCode = tempConnection.getResponseCode();if(tempResponseCode == HttpURLConnection.HTTP_OK){//得到响应流InputStream tempInputStream = tempConnection.getInputStream();//获取响应BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));String tempReturnStr = "";String tempLine;while ((tempLine = tempReader.readLine()) != null){tempReturnStr += tempLine + "\r\n";}Pattern p2 = pile("\"text\": \"([^\"]*)\",\r\n" + "\"name\": \"([^\"]*)\",\r\n" + "\"url\": \"([^\"]*)\",\r\n" + " \"commentCount\": \"[^\"]*\",\r\n" + " \"dateCreated\": \"([^\"]*)\",");Matcher m2 = p2.matcher(tempReturnStr); while(m2.find()) {//System.out.println(m2.group(1));//System.out.println(m2.group(2));//System.out.println(m2.group(3));//System.out.println(m2.group(4));stat.executeUpdate("INSERT INTO house(title,content,time,house_url) VALUES ('" + m2.group(2).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(1).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(4).replace("T"," ") + "','" + m2.group(3) + "');");}tempReader.close();tempInputStream.close();tempConnection.disconnect();}}catch(Exception e) {e.printStackTrace();}}System.out.println("换页");}}catch(Exception e) {e.printStackTrace();}pageStrat+=25;}}}
爬了一会4000多条了,没有再被禁ip。