Building a Web Scrapers in Dart

Imagine this - your hometown has a local exposition fair filled with many stands, from food to activities. Since you’re passionate about that yearly fair, you decide to create a small web application with a map where visitors can find their place and eventually rate their experience.

You start the project, build up the designs but then remember one important fact - “wait, I don’t have an API to get all the stands! Will I have to input them by hand?”.

And before you start panicking about creating a new Excel file and inputting dozens of data entry points by hand, let me introduce you to Web Scraping!

Web Scraping

In a nutshell, Web Scrapping is when you create a small script whose objective is to:

Access a website and read the HTML file
Find relevant information that you need
Organize that information

A quick Google search will show you dozens of tutorials that will use Python - the low barrier of entry for the language plus all the great libraries that were already created specifically for web scraping and data science make it one of the most used languages for it.

However, we are Dart developers, so do we really need to learn a new language to do Web Scraping?

Web Scraping in Dart

In Dart, we can quickly access a lot of web-related helpers with the dart:html import. However, if we want to develop either a multi-platform app or a console application that relies on dart:io, we will need to find an alternative.

Thankfully, universal_html helps us by making dart:html cross-platform! With it, we can:

Open a website locally with WindowController
Access the document for that HTML page
Search the HTML document with querySelectorAll

As an example, let’s use the Expo 2020 Dubai website. Our objective is to have a list of all the countries that have pavilions plus a URL that points to the pavilions’ page. This information is stored on the following page: Country Pavilions | Expo 2020 Dubai.

Analysing the HTML Document

Before writing a single line of code, we must first answer the question:

Where can I find a list with all the country pavilions?

So let’s start with a simple search for a country, for example, Portugal. If we do so, we will find 4 results, all included in this line:

<li class="search__results-item"><a class="search__results-link" data-country="Portugal Pavilion" data-category="Portugal Pavilion" href="/en/understanding-expo/participants/country-pavilions/portugal">Portugal Pavilion</a></li>

This <li> element is inside a <ul> that contains a list of all the countries that have pavilions in Expo 2020, which is exactly what we were looking for!

So the next question is - how can we access it?

The Document.querySelectorAll() JavaScript function will help us here. As seen from the documentation, we can quickly search for a type of Element with a specific class:

document.querySelectorAll("div.note, div.alert")

The above JavaScript snippet will search for all the div elements with class note or alert.

In our case, we want to search for li elements with the class search__results-item. And in fact, if we use the browser’s Dev Tools console in the Expo 2020 pavilions to search for it, we will have a list of all the li elements:

document.querySelectorAll("li.search__results-item")
/// NodeList(191) [li.search__results-item, li.search__results-item, li.search__results-item, li.s

Looking at the results, we see that inside each <li> element we have a <a> element that holds all the information that we need: the country’s name and its URL.

The country name will be the text property;
The URL will be the href property.

<a class="search__results-link" data-country="Portugal Pavilion" data-category="Portugal Pavilion" href="/en/understanding-expo/participants/country-pavilions/portugal">Portugal Pavilion</a>

Since we have all our information, let’s code our simple web scraper!

Getting the HTML document

First things first - let’s create a new Dart project:

dart create expo_simple_scraper

Then, we will need to add universal_html as a dependency in the pubspec.yaml:

dependencies:
  universal_html: ^2.0.8

With this library, we are able to load a new website with the WindowController’s openUri function:

import 'package:universal_html/controller.dart';

void main(List<String> arguments) async {
  final controller = WindowController();
  await controller.openUri(Uri.parse("https://www.expo2020dubai.com/en/understanding-expo/participants/country-pavilions"));

Searching for the li elements

Now that we have loaded the website, we can go ahead and search for the <li> element with the class name search__results-item. Thankfully the syntax is equal to what we have used before in JavaScript:

void main(List<String> arguments) async {
  /// ...

  // The above URL will have a <ul> in which each item, from class `search__results-item`
  // will have as a value a country pavilion
  final pavillionsLiElements = controller.window?.document?.querySelectorAll("li.search__results-item") ?? [];
}

Obtaining data from the a elements

Now that we have a list with all the <li> elements we can:

Search each element for the <a> element;
Retrieve the countries name via the text property
Retrieve the URL via the href property

Dart has a representation for the HTML element classes, which all have the Element suffix:

LIElement represents the <li> tag;
AnchorElement represents the <a> tag.

Furthermore, we know that we will want a resulting JSON file that has a list of all the countries such as:

{
    "country": "India",
    "url": "https://www.expo2020dubai.com/en/understanding-expo/participants/country-pavilions/india"
},

Combining it all together, we have the following:

const String baseExpoUrl = "www.expo2020dubai.com";

void main(List<String> arguments) async {
  /// ...

  final pavillions = <Map<String, String>>[];

  for (LIElement listItem in pavillionsLiElements) {
    final aElement = listItem.children.first as AnchorElement;

    // All countries will have " Pavilion" at the end, eg. "Portugal Pavilion",
    // so we must trim the string, and remove that.
    final country = (aElement.text)?.split(" Pavilion")?.first ?? "";

    // The href for each item will be relative, eg.: "en/understanding-expo/participants/country-pavilions/uk"
    // and [WindowController] will add `http://localhost` as the baseUrl for all relative URLs
    // which means we must remove the base url
    final url = Uri.parse(aElement.href ?? "").replace(scheme: "https", host:  baseExpoUrl);
    pavillions.add(<String, String> {
      "country": country,
      "url" : url.toString(),
    });
  }
}

Creating a JSON file with the output

Now that we have all the data we need, we just have to write it in a new file!

For that, we can use the File class:

void main(List<String> arguments) async {
  // ....

  // `recursive` will create all the missing files and folders
  final output = await io.File("output/countries.json").create(recursive: true);
  await output.writeAsString(jsonEncode(pavillions));
}

This will create a new file called countries.json inside the output folder with the JSON we created in the last step.

Now onto the moment of truth!

To run our simple script we can use:

dart run bin/expo_simple_scraper.dart

And voilá! 🚀

We see our new file created with all the information that we need!

Conclusion

With just a dozen of lines of code, we have created a small script that will get us all the information that we much needed to create our product!

However, with this script, we can scrape multiple websites at the same time or even access different pages on the same website in the span of a few seconds. This means that if we are not careful about our scripts, we can quickly overload the servers, which can cause monetary loss for the company or individual that owns the website. This leads to a whole different topic - the ethics of web scrapping. Thankfully Roberto Rocha wrote an insightful article “On the ethics of web scraping” which delve into other important topics such as if we are allowed to take that data or what we can do with it.

Also, when creating a new web scraper, it would be good to add some information to the header request where we could potentially add a description and our e-mail information so that’s easier for the website owner to reach us, as Lam Thuy Vo describes in Mining Social Media: Finding Stories in Internet Data:

{
    "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36",            
    "from": "Your name example@domain.com"
}

However, at the moment of writing, I could not find any way to attach information to that request with the WindowController class.

So now I’m curious!

Will you be using Dart for your next Web Scraper? If so, what are you going to use it for? Share it with me on Twitter @gonpalma!

You can check the Github Repository here:

https://github.com/Vanethos/dart-web-scrapper-example

Want to get the latest articles and news? Subscribe to the newsletter here 👇

And for other articles, check the rest of the blog! Blog - Gonçalo Palma