bilibili通过用户uid批量爬取视频笔记
本文最后更新于102 天前,其中的信息可能已经过时,如有错误请发送邮件到big_fw@foxmail.com

通过小黄鸟监控流量可以看到通过https://api.bilibili.com/x/space/wbi/arc/search的返回值可以按页得到up的所有视频的标题简介时长bvid等信息,从而得到视频url的列表以爬取视频。

尝试爬取改变pn第二页,发现被拦截,分析后发现时间戳与加密参数w_rid没有通过认证,在前端进行js逆向,得到算法:

_IDX = [46,47,18,2,53,8,23,32,15,50,10,31,58,3,45,35,27,43,5,49,33,9,42,19,29,28,14,39,12,38,41,13,37,48,7,16,24,55,40,61,26,17,0,1,60,51,30,4,22,25,54,21,56,59,6,63,57,62,11,36,20,34,44,52]

def build_mixin_key(img_url: str, sub_url: str) -> str:
    o = img_url.split("/")[-1].split(".")[0]
    a = sub_url.split("/")[-1].split(".")[0]
    s = o + a
    return ''.join(s[i] for i in _IDX)[:32]

def make_wbi_signed_params(params: dict, img_url: str, sub_url: str) -> dict:
  
    p = {k: v for k, v in params.items() if k != "w_rid"}

    p["wts"] = str(int(time.time()))
  
    pairs = []
    for k in sorted(p.keys()):
        v = p[k]
        if v is None:
            v = ""
        v = re.sub(r"[!'()*]", "", str(v))
        pairs.append(f"{quote(str(k), safe='-_.~')}={quote(str(v), safe='-_.~')}")
    m = "&".join(pairs)
    w_rid = hashlib.md5((m + build_mixin_key(img_url, sub_url)).encode("utf-8")).hexdigest()
    p["w_rid"] = w_rid
    return p

加密算法中的img_url与sub_url每天在https://api.bilibili.com/x/web-interface/nav更新,进行爬取即可,结合时间戳和其他已知参数(如页数)可以得到w_rid,然后通过获取的vlist中的bvid对视频进行爬取,得到分离开的视频文件与音频文件,并通过ffmpeg进行拼接,得到视频。

为了便于将视频进行管理,使用MySQL对视频保存地址进行储存:

 for v in vlist:
            if num<=get_num(author_id) and mmm>=numm:
                
                title = v["title"]
                author= v["author"]
                author_id= v["mid"]
                length= v["length"]
                video_review= v["video_review"]
                comment= v["comment"]
                play= v["play"]
                address= "c/act/%s%s.mp4 "%(author,num)
                sql = """
                INSERT INTO `id_%s` (title, content, author_id, author,length, video_review, play,address)
                VALUES (%s, %s, %s, %s, %s, %s, %s,%s)
                """
                cursor.execute(sql, (author_id,title, comment, author_id, author,length, video_review, play,address))
                conn.commit()

对于已经下载过的作者,将检查已经下载的视频数量与其作者视频数量是否一致,若不一致,则删除最后一个视频(防止上次下载时的打断使得视频文件损坏),并从视频队列末尾继续下载。

def get_num(author_id):
    VIDEO_SPACE = f'https://space.bilibili.com/{author_id}/upload/video'
    space_html = requests.get(VIDEO_SPACE, headers=headers, cookies=cookies, verify=False).text
    tree = etree.HTML(space_html)
    ee = tree.xpath("//meta[@name='spm_prefix']/@content")[0] 
    params = {}
    params["mid"]=author_id
    params["web_location"]=ee
    json=requests.get(NAVNUM_URL, headers=headers, cookies=cookies, verify=False,params=params,).json()
    num=json["data"]["video"]
    return num

    
def main_spider(author_id,num=1):
    from load import password,db_name
    conn= pymysql.connect(
        host='localhost',
        user='root',
        password= f"{password}", 
        database= f"{db_name}",
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )
    cursor = conn.cursor()
    mmm=1
    numm=num

源码可参考:special-potato/bilispider at main · PlutoNebula/special-potato

文末附加内容
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
下一篇