1. 8

  2. 3

    I’ve been doing some scraping work lately.

    I think this approach makes a lot of sense as most apps are SPAs now and they get their data via remote API calls. Since UIs that present the data are more prone to be updated than the underlying API calls that feed them the data, this method should arguably be preferable to parsing the HTML source code most of the time.

    Besides, since most HTML these days is (unfortunately) not really semantic, it can sometimes be unreasonably hard to get to the data I want by parsing HTML and even if I succeed the resulting implementations can be very brittle. Example: There are cases where I can only get to the data by relying on purely presentation-related element classes such as col-2-sm (or worse: a long element hierarchy) etc because I don’t have any other reasonable choice. And I really don’t feel safe with this kind of implementation :)

    The obvious downside of this method is the requirement of relying on a fully fledged web browser to do this kind of scraping which is much more costly in terms of resources & performance & implementation complexity than the traditional method of just fetching the data via any HTTP client and parsing it. So there is a trade off between cost and reliability here.