由于项目需要,学习了一下如何从网页抓取数据,进行数据分析。实际上单独使用jsoup也可以直接处理,但是测试过程中发现jsoup处理页页有连接超时的情况,因此,结合httpclient和jsoup做分析处理。
httpclient和jsoup的maven配置如下:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.6</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
分析了一下目标页面,页面通过post请求,httpclient封装post请求,直接上代码
/**
* 封装post请求
* @param url 访问的url
* @param map 参数列表
* @param charset 字符编码
* @return
*/
public static String doPost(String url,Map<String,String> map,String charset){
HttpClient httpClient = null;
HttpPost httpPost = null;
String result = null;
try{
httpClient = new DefaultHttpClient();
httpPost = new HttpPost(url);
//设置参数
List<NameValuePair> list = new ArrayList<NameValuePair>();
Iterator iterator = map.entrySet().iterator();
while(iterator.hasNext()){
Entry<String,String> elem = (Entry<String, String>) iterator.next();
list.add(new BasicNameValuePair(elem.getKey(),elem.getValue()));
}
if(list.size() > 0){
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list,charset);
httpPost.setEntity(entity);
}
HttpResponse response = httpClient.execute(httpPost);
if(response != null){
HttpEntity resEntity = response.getEntity();
if(resEntity != null){
result = EntityUtils.toString(resEntity,charset);
}
}
}catch(Exception ex){
ex.printStackTrace();
}
return result;
}
上述的返回结果,采用jsoup解析,即Jsoup.parse方法,封装方法如下:
public static List<String> getElement(String content){
// try {
// Document document = Jsoup.connect(url).get();//这种情况可以直接解析url
Document document = Jsoup.parse(content);//这种情况是解析网页内容
List<String> list = new ArrayList<>();
// System.out.println(document.toString());
// Elements tableElements = document.getElementsByTag("tr");
Elements tableElements = document.getElementsByClass("viewTable");
Elements trElements = tableElements.get(0).getElementsByTag("tr");
for(int i=1;i<trElements.size();i++){
list.add(trElements.get(i).text().replaceAll(" ", ","));
// System.out.println(trElements.get(i).text().replaceAll(" ", ","));
}
return list;
// } catch (IOException e) {
// e.printStackTrace();
// }
}
通过测试,处理的结果如下:
然后对结果进行处理、入库、分析、查询、展示等操作,达到自己的目标。