Skip to content
This repository has been archived by the owner on Jan 28, 2022. It is now read-only.

crawl only shared_data #31

Open
Stiveknx opened this issue May 3, 2018 · 2 comments
Open

crawl only shared_data #31

Stiveknx opened this issue May 3, 2018 · 2 comments

Comments

@Stiveknx
Copy link

Stiveknx commented May 3, 2018

Well, more like an suggestion, than a "bug" report.

I think you shouldn't load the full instagram page.
First, the element classes (inside elements.json), change with some frequency (have no ideia wich frequency is that).

So, my sugestion it 's just load _sharedData inside the profile.
Don't load javascripts, images, styles.. It's way faster.

Something like this:

        this.page = await this.browser.newPage();
        await this.page.setRequestInterception(true);
        this.page.on('request', (request) => {
            if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
                request.abort();
            } else {
                request.continue();
            }
        });
        await this.page.setExtraHTTPHeaders({
            'Accept-Language': 'pt-BR'
        });
        await this.page.goto('https://instagram.com/' + username, {
            waitUntil: 'networkidle0'
        });
        const sharedData = document.querySelector('script').innerText;
        const html = /window._sharedData = (.*);/.exec(sharedData)[1];
        const profileData = JSON.parse(html);

/* Maybe here you could use your version 1.0 "parseData" function from here ? */
@Stiveknx
Copy link
Author

Stiveknx commented May 3, 2018

I ended up doing this to my project here, based on what you wrote.

If you want I can fork and send a PR. Just let me know.

https://gist.githubusercontent.com/Stiveknx/86342c6588371010a30d8239e07df0ad/raw/a11780c36f28b51958501e11285f05bb2ed1b2e0/profilecrawl.ts

@nacimgoura
Copy link
Owner

thanks for your comment, it's interesting and I hadn't thought about it. Do a PR and then I'll watch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants