Elixir: Optimising a web crawler
This is a really quick summary of how I optimised my personal web crawler. You can check it out HERE
Preconditions, assumptions… things to know before starting to read
- The intention was to create a crawler to scrape specific cooking blogs.
- Hence, all requests are made to the same host.
- The host with which the tests where run was
https://www.recetasderechupete.com
, which has around 1000 recipes. - This is the HTTP client module:
defmodule Spidey.Content do
def scan(url) when is_binary(url) do
try do
url
|> get!()
|> parse_links()
rescue
# Timeout, wrong url, etc.
HTTPoison.Error -> []
# non-html format
CaseClauseError -> []
end
end
def parse_links(html) when is_binary(html) do
html
|> Floki.parse_document!()
|> Floki.find("*[href]")
|> Floki.attribute("href")
end
def get!(url) when is_binary(url) do
url
|> HTTPoison.get!([], timeout: 15_000, recv_timeout: 15_000, follow_redirect: true)
|> Map.get(:body)
end
end
1st approach: Fan-out/Fan-in with Tasks
I started off with a fan-in/fan-out approach because it was the quickest way to get started. It was a bit slow if the website happened to have a slow link in it though.
This is broadly how it worked:
def scan_async([]), do: []
def scan_async(urls) when is_list(urls) do
urls
|> Enum.map(fn url -> Task.async(fn -> Spidey.Content.scan(url) end) end)
|> Enum.map(fn t -> Task.await(t, 30_000) end)
|> List.flatten()
end
Pros:
- Dead simple
Cons:
- shortest request will have to wait for the longest one
Using pooled workers with Genservers
Since the first approach was very inneficient concurrency-wise due to the whole
thing paralising when a slow link was found, I decided to stop this by using a
pool of workers which would operate separately. Initially I tried implementing
it by myself, but ended up picking up poolboy
for simplicity.
Pros:
- More efficient: one worker doesn’t have to wait for all the other workers if a request takes too long.
Cons:
- Much more complex, compared to the three-liner using tasks.
- Requires either implementing a worker pool or using a 3rd party lib like
poolboy
.
Using lazy collections
Once the concurrency solution was in place, I started optimising different parts of the code, starting by how I was handling the crawled data.
To work with the results of each crawled site I was initially using the Enum
module. Since there were a lot of iterations for filtering, etc. I decided to
start using lazy enumerables (the Stream
module). This shaved off a bloody
20% of time (from ~200s before to ~160s after). Ridiculous.
These are some of the filters, although many are applied to reject static assets, different domain urls, etc.:
def strip_query_params(urls) do
Stream.map(urls, fn s -> String.split(s, "?") |> List.first() end)
end
def strip_trailing_slashes(urls) do
Stream.map(urls, fn s -> String.replace_trailing(s, "/", "") end)
end
def reject_invalid_urls(urls) do
urls
|> Stream.reject(&is_nil/1)
|> Stream.reject(&(&1 == ""))
end
Keeping the TCP connection alive
At this point, another important optimisation was to keep the connection alive. All requests were being made to the same host, so it made sense that by skipping the TCP handshakes we would shave some time. Well, this was another big fat 10%, from ~160s to ~140s.
All as simple as telling hackney
:
def get!(url) when is_binary(url) do
HTTPoison.get!(url, [],
timeout: 15_000,
recv_timeout: 15_000,
follow_redirect: true,
hackney: [pool: :default] # This is the key line.
)
end
The real deal: Domain optimisations
Another optimisation I could do is to use an ETS table as a queue instead of an Agent/:queue, but at this stage this would be peanuts IMO, so I decided the way to go was to optimise the actual solution to the problem. The library is a crawler, but I developed to specifically crawl recipe websites. Taking this into account, I started improving the filters:
- Excluding Wordpress links
- Excluding static content
- Excluding RSS feed links
- …
And with this, I started shaving actual relevant time. And even better I learnt a really good lesson: Tweaking software to optimise sometimes is important, but optimising the actual solution is what really makes it or breaks it. Just by excluding wordpress-specific links I was able to shave 22% of the time the crawler took to crawl a full blog with ~1000 links, and this has nothing to do with the software approach.
Final thoughts
TLDR: It has been really fun to develop a half-assed worker pool in Elixir, even thought I eventually ended up picking up a library. I’ve found it incredibly satisfying to optimise an already existing solution… and I’ve become VERY aware of how important it’s to know what you want. Generic solutions are great, but specific ones kick ass on the specific problems.