Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

img tags with missing src which are set via javascript or noscript show as empty #18

Open
Pranoy1c opened this issue Jun 10, 2021 · 0 comments

Comments

@Pranoy1c
Copy link

Pranoy1c commented Jun 10, 2021

The following page:

https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249

has img tags which have empty src attribute. The src is set via javascript upon scroll I think or via noscript tags right after the img tags.

Here's a piece of the page's HTML:

<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>

This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.

I am able to solve the issue by searching for all img tags with missing src and then checking if such Element has a noscript sibling with an img in it and if so, then extract the src from the noscript and set it to the original img:

I placed the following code at the very beginning of the protected open fun removeNoscripts(document: Document) {} function in Preprocessor.kt:

try {
    document.select("img[src=\"\"], img:not([src])").forEach { img ->

//                println("Empty: ${img}")
//                println("Noscript: ${img.siblingElements().select("noscript")}")

        img.siblingElements().select("noscript").firstOrNull()?.let {
            img.attr("src",Jsoup.parse(it.html(), "", Parser.xmlParser()).selectFirst("img").attr("src"))
        }
    }
} catch (e: Exception) {
    println("Exception in setting img for missing src from noscript tags")
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant