当前位置:首页 » 《随便一记》 » 正文

Scrape web content from dynamically loaded page (infinite scroll)

10 人参与  2024年12月18日 16:01  分类 : 《随便一记》  评论

点击全文阅读


题意:从动态加载的页面(无限滚动)抓取网页内容。

问题背景:

I am trying to collect all image filenames from this website: 

我正在尝试收集这个网站的所有图片文件名: https://www.shipspotting.com/

I have already collected a python dict cat_dict of all the category names and their id numbers. So my strategy is to iterate through every category page, call the data loading API and save it's response for every page.

我已经收集了一个 Python 字典 cat_dict,包含所有类别的名称和对应的 ID 号。所以我的策略是遍历每个类别页面,调用数据加载 API,并保存每个页面的响应。

I have identified https://www.shipspotting.com/ssapi/gallery-search as the request URL which loads the next page of content. However, when I request this URL with the requests library, I get a 404. What do I need to do to obtain the correct response in loading the next page of content?

我已经确定 https://www.shipspotting.com/ssapi/gallery-search 是加载下一页内容的请求 URL。然而,当我使用 requests 库请求这个 URL 时,我得到了 404 错误。我需要做什么才能获取正确的响应,加载下一页的内容?

import requestsfrom bs4 import BeautifulSoupcat_page = 'https://www.shipspotting.com/photos/gallery?category='for cat in cat_dict:   cat_link = cat_page + str(cat_dict[cat])   headers = {   "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",   "Referer": cat_link}   response = requests.get('https://www.shipspotting.com/ssapi/gallery-search', headers=headers)   soup = BeautifulSoup(response.text, 'html.parser')

https://www.shipspotting.com/photos/gallery?category=169 

以上是一个示例页面is an example page (cat_link)

问题解决:

Every time you scroll the page down, a new request to server is being made (a POST one, with a certain payload). You can verify this in Dev tools, Network tab.

每次你向下滚动页面时,都会向服务器发送一个新的请求(是一个 POST 请求,带有特定的负载)。你可以在开发者工具的网络标签页中验证这一点。

The following works:

以下代码正常支行:

import requestsfrom bs4 import BeautifulSoup### put the following code in a for loop based on a number of pages ### [total number of ship photos]/[12], or async it ... your choicedata = {"category":"","perPage":12,"page":2} r = requests.post('https://www.shipspotting.com/ssapi/gallery-search', data = data)print(r.json())

This returns a json response:

这会返回一个 JSON 响应:

{'page': 1, 'items': [{'lid': 3444123, 'cid': 172, 'title': 'ELLBING II', 'imo_no': '0000000',....}


点击全文阅读


本文链接:http://zhangshiyu.com/post/203124.html

<< 上一篇 下一篇 >>

  • 评论(0)
  • 赞助本站

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

关于我们 | 我要投稿 | 免责申明

Copyright © 2020-2022 ZhangShiYu.com Rights Reserved.豫ICP备2022013469号-1