当前位置:首页 » 《休闲阅读》 » 正文

webmagic四:爬取中国农网数据入库(初入江湖)

8 人参与  2024年04月23日 14:20  分类 : 《休闲阅读》  评论

点击全文阅读


系列文章目录

springboot整合webmagic入库mysql,定时任务爬取网站系列


提示:完整的代码在最后面

文章目录

前言二、使用步骤 1.引入库2.读入数据总结


前言

之前的文章以springboot定时任务的标题来写的,那几篇文章压根就跟springboot没啥太大关系,结果到这篇文章才真正的用到,主要也是以前没怎么写过博客,尤其是系列的,属于想到哪里就写到哪里。如果你觉得哪里逻辑乱可以评论留言,虽然我不一定会听你的。

一、爬取网站

爬取网站:三农头条-中国农网

二、使用步骤

1.相关环境

jdk版本

mysql数据库表:

pom主要信息(我这里并没有使用作者提供的webmagic-selenium)

        <dependency>            <groupId>us.codecraft</groupId>            <artifactId>webmagic-core</artifactId>            <version>${webmagic.version}</version>        </dependency>        <dependency>            <groupId>us.codecraft</groupId>            <artifactId>webmagic-extension</artifactId>            <version>${webmagic.version}</version>            <exclusions>                <exclusion>                    <groupId>org.slf4j</groupId>                    <artifactId>slf4j-log4j12</artifactId>                </exclusion>            </exclusions>        </dependency><!--        <dependency>--><!--            <groupId>us.codecraft</groupId>--><!--            <artifactId>webmagic-selenium</artifactId>--><!--            <version>${webmagic.version}</version>--><!--        </dependency>-->        <!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->        <dependency>            <groupId>org.seleniumhq.selenium</groupId>            <artifactId>selenium-java</artifactId>            <version>4.1.1</version>            <exclusions>                <exclusion>                    <groupId>com.google.guava</groupId>                    <artifactId>guava</artifactId>                </exclusion>            </exclusions>        </dependency>        <dependency>            <groupId>com.google.guava</groupId>            <artifactId>guava</artifactId>            <version>31.0.1-jre</version>        </dependency>

2.创建FarmerPageProcessor类

代码如下:

package com.spiden.bazaspiden.processor;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.Spider;import us.codecraft.webmagic.pipeline.ConsolePipeline;import us.codecraft.webmagic.processor.PageProcessor;import us.codecraft.webmagic.selector.Html;import java.util.List;/** * @author 张高昌 * @date 2024/1/9 10:22 * @description: 爬取中国农网 */public class FarmerPageProcessor implements PageProcessor {    // 部分一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);    @Override    public void process(Page page) {        // 部分二:定义如何抽取页面信息,并保存下来        Html html = page.getHtml();        List<String> all = html.links().regex("https://www.farmer.com.cn/\\d+/\\d+/\\d+/\\d+\\.html").all();        page.addTargetRequests(all);    }    @Override    public Site getSite() {        return site;    }    public static void main(String[] args) {        Spider.create(new FarmerPageProcessor())                //从哪个网站开始爬取                .addUrl("https://www.farmer.com.cn/farmer/xw/sntt/list.shtml")                .addPipeline(new ConsolePipeline()).run();    }}

其中regex的规则我在上一篇文章已经说过了,这里就不在重复了。如果不清楚可以看下我上一篇文章webmagic三:爬取网站csdn(小试牛刀)-CSDN博客。

运行后我们发现是all集合中是空的,并不是我们想要的。肯定是哪里出了问题。

3.排查问题

List<String> all1 = html.links().all();

添加上面的代码再次调试看看

我们发现所有的有效链接中,并没有我们在浏览器打开想要爬取的链接。

我们想要爬取的链接如下图:(图片违规了,自己打开网站看)

我们要爬取的连接是这样的:

https://www.farmer.com.cn/2024/01/11/99944535.html

https://www.farmer.com.cn/2024/01/11/99944532.html

原因:这个页面是通过 js 以及 ajax 动态加载的。

如果不了解为什么的可以看下webmagic作者的这篇文章使用Selenium来抓取动态加载的页面 - 黄亿华的个人空间 - OSCHINA - 中文开源技术交流社区

4.解决

谷歌浏览器旧版本下载地址:Download older versions of Google Chrome for Windows, Linux and Mac

 我当前的谷歌浏览器版本:109.0.5414.120

 chromedriver下载地址:(不清楚为什么没有120版本的驱动,但是25也可以)

https://chromedriver.storage.googleapis.com/index.html

如果你无法打开谷歌旧历史版本可以从这里下载:

https://download.csdn.net/download/qq_28827185/88730566

5.代码

新建FarmerPageProcessor类

package com.spiden.bazaspiden.processor;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;import us.codecraft.webmagic.selector.Html;import java.util.List;/** * @author 张高昌 * @date 2024/1/9 10:22 * @description: 爬取中国农网 */@Componentpublic class FarmerPageProcessor implements PageProcessor {    // 部分一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等    private Site site;    @Override    public void process(Page page) {        // 部分二:定义如何抽取页面信息,并保存下来        Html html = page.getHtml();        List<String> all = html.links().regex("https://www.farmer.com.cn/\\d+/\\d+/\\d+/\\d+\\.html").all();        page.addTargetRequests(all);        // 部分三:将爬取到的数据保存在page中,方便Pipeline对其持久化操作        page.putField("title", page.getHtml().xpath("/html/body/div[5]/div[2]/div[2]/div[1]").toString());        page.putField("content", page.getHtml().xpath("/html/body/div[5]/div[2]/div[2]/div[4]").toString());    }    @Override    public Site getSite() {        if (site == null) {            site = Site.me().setRetryTimes(3).setSleepTime(1000);        }        return site;    }}

新建FarmerPipeline类

package com.spiden.bazaspiden.pipeline;import com.spiden.bazaspiden.mapper.FarmerMapper;import com.spiden.bazaspiden.model.Farmer;import org.springframework.stereotype.Component;import us.codecraft.webmagic.ResultItems;import us.codecraft.webmagic.Task;import us.codecraft.webmagic.pipeline.Pipeline;import javax.annotation.Resource;/** * @author 张高昌 * @version 1.0 * @date 2024/1/10 14:44 */@Componentpublic class FarmerPipeline implements Pipeline {    @Resource    private FarmerMapper farmerMapper;    @Override    public void process(ResultItems resultItems, Task task) {        String title = resultItems.get("title");        String context = resultItems.get("content");        Farmer farmer = new Farmer();        if (title != null) {            farmer.setTitle(title);            farmer.setContent(context);            farmerMapper.insertSelective(farmer);        }    }}

新建Farmer

package com.spiden.bazaspiden.model;import lombok.Data;/** * @author 张高昌 * @version 1.0 * @date 2024/1/10 17:24 */@Datapublic class Farmer {    private String title;    private String content;}

新建FarmerMapper接口

package com.spiden.bazaspiden.mapper;import com.spiden.bazaspiden.model.Farmer;import org.apache.ibatis.annotations.Mapper;/** * @author 张高昌 * @version 1.0 * @date 2024/1/10 17:22 */@Mapperpublic interface FarmerMapper {    int insertSelective(Farmer record);}

sql文件

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"><mapper namespace="com.spiden.bazaspiden.mapper.FarmerMapper">    <insert id="insertSelective">        insert into news (title, content)        values (#{title}, #{content})    </insert></mapper>

新建定时任务

package com.spiden.bazaspiden.task;import com.spiden.bazaspiden.pipeline.FarmerPipeline;import com.spiden.bazaspiden.processor.FarmerPageProcessor;import com.spiden.bazaspiden.utils.AppConfig;import com.spiden.bazaspiden.webmagic.selenium.NewSeleniumDownloader;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Spider;import javax.annotation.Resource;/** * @author 张高昌 * @date 2024/1/4 19:45 * @description: 爬虫定时任务 */@Componentpublic class NewsScheduled {    /**     * 一定要将Pipeline的实现类交给spring管理     */    @Resource    private FarmerPipeline farmerPipeline;    //爬取地址目前先写在这里,后续是要爬取多个网站    private String pathUrl = "https://www.farmer.com.cn/farmer/xw/sntt/list.shtml";    /**     * 10分钟执行一次     */    @Scheduled(cron = "0 0/10 * * * ? ")    public void testFarmer() {        Spider spider = Spider.create(new FarmerPageProcessor());        spider.addUrl(pathUrl)                .setDownloader(new NewSeleniumDownloader(AppConfig.getChromeDriverPath)                        .setSleepTime(2000))                //开启5个线程爬取,意味着浏览器会打开5个窗口同时操作                .thread(5)                .addPipeline(farmerPipeline)                .run();//        spider.setExitWhenComplete(true);//        spider.start();//        spider.stop();    }}

这里需要特别特别特别注意的地方是NewSeleniumDownloader()这个类,我是修改了这个类,webmagic作者已经抛弃了phantomjs,如果你想要使用作者的,就必须降低版本否则会出现各种问题,我在这里被坑了很久。

新建NewSeleniumDownloader、WebDriverPool、selenium.properties

package com.spiden.bazaspiden.webmagic.selenium;import org.openqa.selenium.By;import org.openqa.selenium.Cookie;import org.openqa.selenium.WebDriver;import org.openqa.selenium.WebElement;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Request;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.Task;import us.codecraft.webmagic.downloader.AbstractDownloader;import us.codecraft.webmagic.selector.Html;import us.codecraft.webmagic.selector.PlainText;import java.io.Closeable;import java.io.IOException;import java.util.Map;/** * @author 张高昌 * @version 1.0 * @date 2024/1/10 13:50 */public class NewSeleniumDownloader extends AbstractDownloader implements Closeable {    private volatile WebDriverPool webDriverPool;    private Logger logger = LoggerFactory.getLogger(getClass());    private int sleepTime = 0;    private int poolSize = 1;    /**     * 新建     *     * @param chromeDriverPath chromeDriverPath     */    public NewSeleniumDownloader(String chromeDriverPath) {        System.getProperties().setProperty("webdriver.chrome.driver",                chromeDriverPath);    }    /**     * Constructor without any filed. Construct PhantomJS browser     *     * @author bob.li.0718@gmail.com     */    public NewSeleniumDownloader() {        // System.setProperty("phantomjs.binary.path",        // "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");    }    /**     * set sleep time to wait until load success     *     * @param sleepTime sleepTime     * @return this     */    public NewSeleniumDownloader setSleepTime(int sleepTime) {        this.sleepTime = sleepTime;        return this;    }    @Override    public Page download(Request request, Task task) {        checkInit();        WebDriver webDriver = null;        Page page = Page.fail(request);        try {            webDriver = webDriverPool.get();            logger.info("downloading page " + request.getUrl());            webDriver.get(request.getUrl());            try {                if (sleepTime > 0) {                    Thread.sleep(sleepTime);                }            } catch (InterruptedException e) {                e.printStackTrace();            }            WebDriver.Options manage = webDriver.manage();            Site site = task.getSite();            if (site.getCookies() != null) {                for (Map.Entry<String, String> cookieEntry : site.getCookies()                        .entrySet()) {                    Cookie cookie = new Cookie(cookieEntry.getKey(),                            cookieEntry.getValue());                    manage.addCookie(cookie);                }            }            /*             * TODO You can add mouse event or other processes             *             * @author: bob.li.0718@gmail.com             */            WebElement webElement = webDriver.findElement(By.xpath("/html"));            String content = webElement.getAttribute("outerHTML");            page.setDownloadSuccess(true);            page.setRawText(content);            page.setHtml(new Html(content, request.getUrl()));            page.setUrl(new PlainText(request.getUrl()));            page.setRequest(request);            onSuccess(page, task);        } catch (Exception e) {            logger.warn("download page {} error", request.getUrl(), e);            onError(page, task, e);        } finally {            if (webDriver != null) {                webDriverPool.returnToPool(webDriver);            }        }        return page;    }    private void checkInit() {        if (webDriverPool == null) {            synchronized (this) {                webDriverPool = new WebDriverPool(poolSize);            }        }    }    @Override    public void setThread(int thread) {        this.poolSize = thread;    }    @Override    public void close() throws IOException {        webDriverPool.closeAll();    }}
package com.spiden.bazaspiden.webmagic.selenium;import org.openqa.selenium.WebDriver;import org.openqa.selenium.chrome.ChromeDriver;import org.openqa.selenium.firefox.FirefoxDriver;//import org.openqa.selenium.phantomjs.PhantomJSDriver;//import org.openqa.selenium.phantomjs.PhantomJSDriverService;import org.openqa.selenium.remote.DesiredCapabilities;import org.openqa.selenium.remote.RemoteWebDriver;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.io.FileReader;import java.io.IOException;import java.net.MalformedURLException;import java.net.URL;import java.util.ArrayList;import java.util.Collections;import java.util.List;import java.util.Properties;import java.util.concurrent.BlockingDeque;import java.util.concurrent.LinkedBlockingDeque;import java.util.concurrent.atomic.AtomicInteger;/** * @author 张高昌 * @version 1.0 * @date 2024/1/10 13:59 */public class WebDriverPool {    private Logger logger = LoggerFactory.getLogger(getClass());    private final static int DEFAULT_CAPACITY = 5;    private final int capacity;    private final static int STAT_RUNNING = 1;    private final static int STAT_CLODED = 2;    private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);    /*     * new fields for configuring phantomJS     */    private WebDriver mDriver = null;    private boolean mAutoQuitDriver = true;    private static final String DEFAULT_CONFIG_FILE = "selenium.properties";    private static final String DRIVER_FIREFOX = "firefox";    private static final String DRIVER_CHROME = "chrome";    private static final String DRIVER_PHANTOMJS = "phantomjs";    protected static Properties sConfig;    protected static DesiredCapabilities sCaps;    /**     * Configure the GhostDriver, and initialize a WebDriver instance. This part     * of code comes from GhostDriver.     * https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver     *     * @author bob.li.0718@gmail.com     * @throws IOException     */    public void configure() throws IOException {        // Read config file        sConfig = new Properties();        String configFile = DEFAULT_CONFIG_FILE;        if (System.getProperty("selenuim_config")!=null){            configFile = System.getProperty("selenuim_config");        }//        sConfig.load(new FileReader(configFile));        sConfig.load(Thread.currentThread().getContextClassLoader().getResourceAsStream(configFile));        // Prepare capabilities        sCaps = new DesiredCapabilities();        sCaps.setJavascriptEnabled(true);        sCaps.setCapability("takesScreenshot", false);        String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);        // Fetch PhantomJS-specific configuration parameters        if (driver.equals(DRIVER_PHANTOMJS)) {            // "phantomjs_exec_path"            if (sConfig.getProperty("phantomjs_exec_path") != null) {//                sCaps.setCapability(//                        PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,//                        sConfig.getProperty("phantomjs_exec_path"));            } else {                throw new IOException(//                        String.format(//                                "Property '%s' not set!",//                                PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY)                );            }            // "phantomjs_driver_path"            if (sConfig.getProperty("phantomjs_driver_path") != null) {                System.out.println("Test will use an external GhostDriver");//                sCaps.setCapability(//                        PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,//                        sConfig.getProperty("phantomjs_driver_path")//                );            } else {                System.out                        .println("Test will use PhantomJS internal GhostDriver");            }        }        // Disable "web-security", enable all possible "ssl-protocols" and        // "ignore-ssl-errors" for PhantomJSDriver        // sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new        // String[] {        // "--web-security=false",        // "--ssl-protocol=any",        // "--ignore-ssl-errors=true"        // });        ArrayList<String> cliArgsCap = new ArrayList<String>();        cliArgsCap.add("--web-security=false");        cliArgsCap.add("--ssl-protocol=any");        cliArgsCap.add("--ignore-ssl-errors=true");//        sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,//                cliArgsCap);        // Control LogLevel for GhostDriver, via CLI arguments//        sCaps.setCapability(//                PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,//                new String[] { "--logLevel="//                        + (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig//                        .getProperty("phantomjs_driver_loglevel")//                        : "INFO") });        // String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);        // Start appropriate Driver        if (isUrl(driver)) {            sCaps.setBrowserName("phantomjs");            mDriver = new RemoteWebDriver(new URL(driver), sCaps);        } else if (driver.equals(DRIVER_FIREFOX)) {            mDriver = new FirefoxDriver(sCaps);        } else if (driver.equals(DRIVER_CHROME)) {            mDriver = new ChromeDriver(sCaps);        } else if (driver.equals(DRIVER_PHANTOMJS)) {//            mDriver = new PhantomJSDriver(sCaps);        }    }    /**     * check whether input is a valid URL     *     * @author bob.li.0718@gmail.com     * @param urlString urlString     * @return true means yes, otherwise no.     */    private boolean isUrl(String urlString) {        try {            new URL(urlString);            return true;        } catch (MalformedURLException mue) {            return false;        }    }    /**     * store webDrivers created     */    private List<WebDriver> webDriverList = Collections            .synchronizedList(new ArrayList<WebDriver>());    /**     * store webDrivers available     */    private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();    public WebDriverPool(int capacity) {        this.capacity = capacity;    }    public WebDriverPool() {        this(DEFAULT_CAPACITY);    }    /**     *     * @return     * @throws InterruptedException     */    public WebDriver get() throws InterruptedException {        checkRunning();        WebDriver poll = innerQueue.poll();        if (poll != null) {            return poll;        }        if (webDriverList.size() < capacity) {            synchronized (webDriverList) {                if (webDriverList.size() < capacity) {                    // add new WebDriver instance into pool                    try {                        configure();                        innerQueue.add(mDriver);                        webDriverList.add(mDriver);                    } catch (IOException e) {                        e.printStackTrace();                    }                    // ChromeDriver e = new ChromeDriver();                    // WebDriver e = getWebDriver();                    // innerQueue.add(e);                    // webDriverList.add(e);                }            }        }        return innerQueue.take();    }    public void returnToPool(WebDriver webDriver) {        checkRunning();        innerQueue.add(webDriver);    }    protected void checkRunning() {        if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {            throw new IllegalStateException("Already closed!");        }    }    public void closeAll() {        boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);        if (!b) {            throw new IllegalStateException("Already closed!");        }        for (WebDriver webDriver : webDriverList) {            logger.info("Quit webDriver" + webDriver);            webDriver.quit();            webDriver = null;        }    }}

selenium.properties配置文件

# What WebDriver to use for the tests#driver=phantomjs#driver=firefoxdriver=chrome#driver=http://localhost:8910#driver=http://localhost:4444/wd/hub# PhantomJS specific config (change according to your installation)#phantomjs_exec_path=/Users/Bingo/bin/phantomjs-qt5#phantomjs_exec_path=d:/phantomjs.exe#chrome_exec_path=E:\\demo\\crawler\\chromedriver.exe#phantomjs_driver_path=/Users/Bingo/Documents/workspace/webmagic/webmagic-selenium/src/main.js#phantomjs_driver_loglevel=DEBUGchrome_driver_loglevel=DEBUG

如果你想使用作者提供的或者对作者写的感兴趣可以看这里:

https://github.com/code4craft/webmagic/blob/develop/webmagic-selenium/src/main/java/us/codecraft/webmagic/downloader/selenium/SeleniumDownloader.java

最后一步在springboot启动类上面加上@EnableScheduling

@EnableScheduling@SpringBootApplicationpublic class BazaSpidenApplication {    public static void main(String[] args) {        SpringApplication.run(BazaSpidenApplication.class, args);    }}

启动项目:

看下数据库

完整的代码地址:baza-spiden: springboot整合webmagic+mybatis入库


总结

这里需要特别注意的就是谷歌浏览器的版本以及谷歌驱动的版本,这2个如果不对就会出现各种神奇的问题。还有就是selenium-java这个包也很重要。

这里本来时想按照我写代码时出现的问题一个个都写上去的,后面想想太麻烦了,就把这个过程直接省掉了。

这里爬虫的基本操作就结束了,但是有几个问题在这里:

1.以我们公司实际的需求为例,要求爬取50多个跟农业相关的网站,每个网站的页面布局又不一样,规则也不一样。

2.xpath的规则谁来填写?总不能让我来去一个个网站找到,然后再在代码里写好xpath规则吧?

3.因为我这里爬取的是新闻网站,领导要求把图片保存到我们自己的服务器。

4.每个网站的页面文章样式,布局是不一样的,我这里需要将那些样式全部都去除掉。(好吧,天大地大,领导要求的最大)


点击全文阅读


本文链接:http://zhangshiyu.com/post/99108.html

<< 上一篇 下一篇 >>

  • 评论(0)
  • 赞助本站

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

关于我们 | 我要投稿 | 免责申明

Copyright © 2020-2022 ZhangShiYu.com Rights Reserved.豫ICP备2022013469号-1