写给自己,新知识的总结。 最近有个需求要爬一些百度贴吧上帖子的发言和回复,所以就去学习了一下如何使用java爬虫来爬数据。 直接上代码吧! 如果只是爬源码的话只用httpclient.jar一个包用了,如果要解析的话还得加上jsoup.jar包,解析后面有空再写吧。
一、
<!-- 在maven项目(jar或war都可以)的pom.xml中加入如下代码,导入jar包,我用的是阿里云的镜像库 --> <dependencies> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.3.1</version> </dependency> </dependencies>二、 创建类 代码如下:
package com.myself.crawl; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import org.apache.http.HttpEntity; import org.apache.http.HttpStatus; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; /* * @author sjia * @Date 2017年4月14日--下午12:40:50 */ public class HttpGetUtils { public static void main(String[] args) { String str=get("http://www.baidu.com"); System.out.println(str); } /** * get 方法 * @param url * @return */ public static String get(String url){ String result = ""; try { //获取httpclient实例 CloseableHttpClient httpclient = HttpClients.createDefault(); //获取方法实例。GET HttpGet httpGet = new HttpGet(url); //执行方法得到响应 CloseableHttpResponse response = httpclient.execute(httpGet); try { //如果正确执行而且返回值正确,即可解析 if (response != null && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { System.out.println(response.getStatusLine()); HttpEntity entity = response.getEntity(); //从输入流中解析结果 result = readResponse(entity, "utf-8"); } } finally { httpclient.close(); response.close(); } }catch (Exception e){ e.printStackTrace(); } return result; } /** * stream读取内容,可以传入字符格式 * @param resEntity * @param charset * @return */ private static String readResponse(HttpEntity resEntity, String charset) { StringBuffer res = new StringBuffer(); BufferedReader reader = null; try { if (resEntity == null) { return null; } reader = new BufferedReader(new InputStreamReader( resEntity.getContent(), charset)); String line = null; while ((line = reader.readLine()) != null) { res.append(line); } } catch (Exception e) { e.printStackTrace(); } finally { try { if (reader != null) { reader.close(); } } catch (IOException e) { } } return res.toString(); } }三、 测试!运行main方法,我这里以百度为例输出情况如下:
然后就结束了!