Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Replay Bug]: Substack blog, images not displaying unless expanded, some scripts not working #379

Closed
JubilantJerry opened this issue Dec 10, 2024 · 1 comment
Labels
bug Something isn't working replay bug Archived content is not displaying as expected

Comments

@JubilantJerry
Copy link

JubilantJerry commented Dec 10, 2024

ReplayWeb.page Version

v2.2.4

What did you expect to happen? What happened instead?

There are several images on the page that directly get displayed when opening the live site. However, archiving the page with grab-site and replaying with ReplayWeb.page, the images do not load directly, appearing as broken images or blank spaces. Some of them may appear when expanded by clicking on the broken images.

Archived:
image

Live site:
image

Archived:
image

Live site:
image

I have verified that the resource behind the broken images has indeed been archived by looking for the src attribute on the live page and looking for the same URL inside the archive, so I don't think the issue is in the crawling. For example, this thumbnail image appears on the live site and is part of the archive, but does not display in the replay as seen in the first screenshot.

In addition, some scripts don't work properly. When navigating to the previous or next blog page, ReplayWeb.page will first display a page saying "Post not found". Refreshing the page will make it load properly (but still with the missing images).

image

My belief is that both the missing images and the script errors are replay issues.

Step-by-step reproduction instructions

First I run:

grab-site --level=2 --concurrency=20 --page-requisites-level=2 --import-ignores=$(pwd)/ignores 'https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting' 'https://substackcdn.com/bundle/assets/store.modern-3dec36e9.js' 'https://substack-post-media.s3.amazonaws.com/public/images/4206cf36-9fcc-4b06-95e1-d751f9f4c3b7_388x388.jpeg'

I include these other two URLs so that their domain names shouldn't be considered "offsite".

The contents of the ignores file is:

platform.openai.com
reddit.com
discord.com
discordapp.com
^https?://[^p][^.]+.substack.com
shopify.com
^https://static.airtable.com/esbuild/by_sha
https://promptingweekly.substack.com/account\?utm_medium=web&utm_source=subscribe-widget
https://promptingweekly.substack.com/p/[^?/]+\?utm_source=substack&utm_medium=email&utm_content=share&action=share&token=

Then I open the archive using ReplayWeb.page-2.2.4.AppImage, and navigate to the page: https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting

You can download the WARC here: https://drive.google.com/file/d/1fJuWwgSTVfh9IdD47RC2lw67tWSryG4S/view?usp=sharing

Additional details

I run Ubuntu 20.04 LTS.

In case it's still because some files didn't get crawled, I also made a 3.8GB version of the archive where I set no upper bound on --level and I set --page-requisites-level=20. The problem persists with the bigger archive, but it's too big to upload here so I provide a smaller one for repro.

@JubilantJerry JubilantJerry added bug Something isn't working replay bug Archived content is not displaying as expected labels Dec 10, 2024
@JubilantJerry
Copy link
Author

On further investigation, it does seem more like a problem with the crawling. The live page actually uses the <source srcset="..."> tag to display the page, not <img src="...">, and it seems like the URL for the srcset attribute is not included in the archive.

image

image

I'm now raising the issue with grab-site here and closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working replay bug Archived content is not displaying as expected
Projects
None yet
Development

No branches or pull requests

1 participant