【python】Beautiful Soup4を使って日本証券取引所のサイトでスクレイピングに挑戦

概要（Beautiful Soupとは）
作業環境
Beautiful Soupの利用準備
方針の検討
作成したプログラム
実行結果
1. 共有:
2. いいね:

概要（Beautiful Soupとは）

PythonでWebページからのデータ取得（Webスクレイピング）を簡単に行うためのライブラリです。

ウェブサイトのデータを解析し、文章やリンクを取得できます。

今回は日本証券取引所のウェブサイトから、「信用取引残高等」のpdfを取得します。

作業環境

windows 11
Python 3.9.7
Spyder IDE 5.1.5

Beautiful Soupの利用準備

Beautiful Soupはpythonのライブラリですので、Anaconda promptなどを開いて下記コマンドを実行してインストールしておきます。

pip install beautifulsoup4

方針の検討

実際のサイトを覗いて、取得したいデータを確認します。

今回取得したいのは下記画像左のpdfになります。

各週の金曜日の日付で更新されているようです。

https://www.jpx.co.jp/markets/statistics-equities/margin/05.html

pdfを開いてリンクを見てみます。日付の部分だけ異なっているようです。

日付の形式はyyyymmdddd（yyyy年mm月dd日）ですね。

例：2025年1月24日申し込み分

➡https://www.jpx.co.jp/markets/statistics-equities/margin/tvdivq0000001rnl-att/syumatsu2025012400.pdf

例：2025年1月17日申し込み分

➡https://www.jpx.co.jp/markets/statistics-equities/margin/tvdivq0000001rnl-att/syumatsu2025011700.pdf

なので、プログラムの方針は下記です。初心者なので、もっと上手なプランがあったら教えていただきたいです。

作成したプログラム

リクエスト操作には、追加ライブラリのインストールが不要なhtml.parserを利用しました。

from bs4 import BeautifulSoup
import os
from datetime import datetime, timedelta

# 最近から1週間前の金曜日を計算する関数
def get_friday_one_week_ago():
    today = datetime.now()
    # 今日を含めて最も最近の金曜日を計算
    days_since_friday = (today.weekday() - 4) % 7
    most_recent_friday = today - timedelta(days=days_since_friday)
    
    # 1週間前の金曜日を計算してyyyymmdd形式で取得
    friday_one_week_ago = most_recent_friday - timedelta(weeks=1)
    return friday_one_week_ago.strftime("%Y%m%d")  # 例: 20240101

# ターゲットURL
url = "https://www.jpx.co.jp/markets/statistics-equities/margin/05.html"

# リクエストを送信してHTMLを取得
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

# 1週間前の金曜日の日付を計算
friday_one_week_ago = get_friday_one_week_ago()
formatted_date = f"{friday_one_week_ago[:4]}年{int(friday_one_week_ago[4:6])}月{int(friday_one_week_ago[6:])}日"  # 例: 2024年1月1日
pdf_link = None

# 指定した日付を含むリンクを探す
for link in soup.find_all("a", href=True):
    if friday_one_week_ago in link["href"]:
        pdf_link = link["href"]
        break

if pdf_link:
    # 完全なURLを作成
    pdf_url = f"https://www.jpx.co.jp{pdf_link}" if pdf_link.startswith("/") else pdf_link
    print(f"PDF URL: {pdf_url}")

    # PDFをダウンロード
    pdf_response = requests.get(pdf_url)
    pdf_response.raise_for_status()

    # 保存先フォルダは実行ディレクトリ内の "pdf" フォルダ
    # フォルダが存在しない場合は作成
    save_dir = os.path.join(os.getcwd(), "pdf")  # 
    os.makedirs(save_dir, exist_ok=True)  
    pdf_filename = os.path.join(save_dir, os.path.basename(pdf_url))  # 保存先ファイルパス

    # PDFを保存
    with open(pdf_filename, "wb") as pdf_file:
        pdf_file.write(pdf_response.content)
    print(f"PDFを保存しました: {pdf_filename}")
else:
    print(f"{formatted_date} を含むリンクが見つかりませんでした。")

実行結果

ちゃんとフォルダに保存できました！

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31