最近领导让我维护以前老的爬虫,我看到代码后瞬间就不想写了,只好从头开始自己做个爬虫。发现java爬虫的框架不怎么多,当然是相对于Python来说,最后选择了webmagic作为开发框架。
写这篇文章的目的是因为在实际的开发中遇到了不少头疼的问题,特此记录。
WebMagic中文网址:Introduction · WebMagic Documents
想要从头开始学习一个东西最好的办法就是从官网看文档,或者在文章里看大佬怎么看官网的。
我这里用的用的是springboot,使用Maven的方式
<dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version></dependency><dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version></dependency>
从官网复制的java代码如下:
package com.example.testspringboot.processor;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.Spider;import us.codecraft.webmagic.processor.PageProcessor;public class GithubRepoPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(100); @Override public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); if (page.getResultItems().get("name")==null){ //skip this page page.setSkip(true); } page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()")); } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run(); }}
点击main运行:go!!!
09:22:53.972 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft errorjavax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)at sun.security.ssl.HandshakeContext.<init>(HandshakeContext.java:171)at sun.security.ssl.ClientHandshakeContext.<init>(ClientHandshakeContext.java:101)at sun.security.ssl.TransportContext.kickstart(TransportContext.java:238)at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)at us.codecraft.webmagic.Spider.access$000(Spider.java:61)at us.codecraft.webmagic.Spider$1.run(Spider.java:320)at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)09:22:54.081 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.
第一步开始就报错,我也是很不开心,搜索了一下,发现这个问题是因为高版本的jdk安全协议导致的。
我当前的jdk版本是1.8_291
C:\Users\MI>java -versionjava version "1.8.0_291"Java(TM) SE Runtime Environment (build 1.8.0_291-b10)Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)C:\Users\MI>
找到jdk安装的目录,编辑这个java.security配置文件
在配置文件中找到如下:
jdk.tls.disabledAlgorithms=SSLv3, TLSv1, TLSv1.1, RC4, DES, MD5withRSA, \
DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL, \
include jdk.disabled.namedCurves
将SSLv3, TLSv1, TLSv1.1都去掉;如下:
这几个是干嘛的?这偏文章说的很清楚了:一文讲清SSL协议_sslv3-CSDN博客
再次运行
先说解决办法:将webmagic版本升级到0.10.0
<dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.10.0</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.10.0</version> </dependency>
升级的原因是因为:0.7.3版本默认的HttpClient只会用TLSv1去请求,对于某些只支持TLS1.2的站点,就会报错。Https下无法抓取只支持TLS1.2的站点 · Issue #701 · code4craft/webmagic · GitHub
webmagic版本地址:Releases · code4craft/webmagic · GitHub
再次运行ok了
下一篇开始正式爬虫的工作