Project Soon

Whois timeout and automatic subnet blocking

I have had for a while now set up a service which grabs all IP ranges owned by certain companies. I currently have two, as seen below, but the second was added recently. After setting this up and it ran locally fine, and twice as a cronjob, I later suddenly got email that the cronjob timed out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#!/bin/bash

set -e

ROOT="$(dirname "$(readlink -f "$0")")"

out_path="$1"

fetch_data() {
  whois -h whois.radb.net -- "-i origin $1" | tee \
    >(grep "^route[^6]" | awk '{print $2}' | python3 "$ROOT/collapse_addresses.py" > $out_path/$2.txt) \
    >(grep "^route6" | awk '{print $2}' | python3 "$ROOT/collapse_addresses.py" > $out_path/${2}6.txt) \
    >/dev/null
}

fetch_data 'AS32934' 'facebook'
fetch_data 'AS45102' 'alibaba'

Some investigation later I found out that the initial command itself worked fine, running it several times without issues. Then running the script manually caused it to timeout. My initial hunch was that the piping was not properly done, being stuck somewhere in the middle with megabytes of data, resulting in the buffers getting full, and therefore the service timing out due to inefficient processing.

I already had a python script used specifically to reduce the ranges into smaller more manageable lists. For instance, the service whois 1 returned a bunch of IP ranges registered to this specific ASN 2, but these IP ranges could be associated to anywhere in the world. Public IP ranges is often heavily fragmented, and while only small ranges are associated to specific countries, this results that even if a company owns one large group of IPs, that range is usually split up into several smaller ranges and distributed to servers all over the world. Therefore, it is up to me to merge these subnets into their original ranges, resulting in the script below.

1
2
3
4
5
#!/bin/env python3
from ipaddress import ip_network, collapse_addresses
from sys import stdin

print('\n'.join(ip.compressed for ip in collapse_addresses(ip_network(ip) for ip in stdin.read().split())))

As I already used python for parts of the task, I decided to merge everything into a single script, both properly piping the data into a faster language, but also making it easier to understand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/env python3
import sys
from subprocess import Popen, PIPE
import re
from ipaddress import ip_network, collapse_addresses

if len(sys.argv) != 2:
  print("Requires an out path")
  sys.exit(1)

out_path = sys.argv[1]

def fetch_data(asn, name):
  cmd = ['whois', '-h', 'whois.radb.net', '--', f"-i origin {asn}"]
  route = re.compile(r'^route:\s+(.*)')
  route6 = re.compile(r'^route6:\s+(.*)')
  ips = []
  ips6 = []
  with Popen(cmd, stdout=PIPE, bufsize=0, universal_newlines=True) as p:
    for line in p.stdout:
      m = route.match(line)
      if m:
        ips.append(ip_network(m.group(1)))
      m6 = route6.match(line)
      if m6:
        ips6.append(ip_network(m6.group(1)))
  collapse_ranges = lambda l: (ip.compressed for ip in collapse_addresses(l))
  with open(f'{out_path}/{name}.txt', 'w') as f:
    print('\n'.join(collapse_ranges(ips)), file=f)
  with open(f'{out_path}/{name}6.txt', 'w') as f:
    print('\n'.join(collapse_ranges(ips6)), file=f)

fetch_data('AS32934', 'facebook')
fetch_data('AS45102', 'alibaba')

This is definitely not the last change I will do, as both changing host and add backoff/retry if it fails, are missing features. I previously ran the script daily, but have now changed it to weekly, to reduce the load put on their service. I suspect that most companies buying new IP ranges does this in bulk, and not very often. For instance, Facebook owns just under 500 thousands IPs, and Alibaba owns just over 2.5 million IPs. Compare to the amount of public IPs in the world: 3.7 billion.

As one can see above, I have prepared for IPv6, but I am currently not using it. I would assume that when I add IPv6 support for my homelab, it might be both easier and harder to avoid companies trying to access them.


  1. https://en.wikipedia.org/wiki/WHOIS ↩︎

  2. https://en.wikipedia.org/wiki/Autonomous_system_(Internet) ↩︎