0

I have the following snippet of code that will download images from a list of URLs. However, images are not always downloaded. When running this code, it's possible to get one webp and one 403 response, two 403 responses, or two webps. The image(s) that successfully download changes each time. ~5% of the runs it will download both images, ~60% of the time it fails to download both. Why could this be?

I've used headers from browser requests, modified the headers numerous times, the site does not use cookies so there aren't any for me to add, the URLs work in my browser, I've added allow_redirects=True when things weren't working, added sleep(), changed the chunk size. I'm out of ideas.

import requests
from time import sleep

def get_headers():
  headers = {
    "method": "GET",
    "credentials": "omit",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Alt-Used": "DOMAIN",
    "Referrer": "https://DOMAIN",
    "mode": "cors",
    'Connection': 'keep-alive',
    'X-Requested-With': 'XMLHttpRequest',
  }
  return headers


def download_image(url, dst):
  with open(dst, 'wb') as f:
    response = requests.get(url, stream=True, allow_redirects=True, headers=get_headers()) # <--- The request that fails
    if not response.ok:
      print(response)
    for chunk in response.iter_content(1024):
      if not chunk:
        break
      f.write(chunk)

urls = [
  "https://DOMAIN/covers/full/ta/FILENAME1.webp",
  "https://DOMAIN/covers/full/vr/FILENAME2.webp",
  "https://DOMAIN/covers/full/ui/FILENAME3.webp",
]
for i, url in enumerate(urls, start=1):
  dst = f'{i}.webp'
  sleep(1)
  download_image(url, dst)
  sleep(1)

Edit: a similar phenomena happens with the selenium chrome webdriver...

0

Browse other questions tagged or ask your own question.