Index of /datasets/topology/ark/ipv4/itdk/2020-08
Name Last modified Size Description
Parent Directory -
README.txt 2020-12-11 18:10 16K
itdk-run-20200819-dns-names.txt.bz2 2020-09-06 14:51 22M IPv4 Routed /24 DNS Names Dataset
itdk-run-20200819.addrs.bz2 2020-09-12 22:19 9.0M Macroscopic Internet Topology Data Kit (ITDK)
kapar-midar-iff.ifaces.bz2 2020-09-16 19:48 532M
kapar-midar-iff.links.bz2 2020-09-16 19:42 562M
kapar-midar-iff.nodes.as.bz2 2020-09-24 15:08 104M
kapar-midar-iff.nodes.bz2 2020-09-16 19:36 436M
kapar-midar-iff.nodes.geo.bz2 2020-09-23 16:37 213M
midar-iff.ifaces.bz2 2020-09-11 12:55 533M
midar-iff.links.bz2 2020-09-11 12:50 570M
midar-iff.nodes.as.bz2 2020-09-25 02:53 105M
midar-iff.nodes.bz2 2020-09-11 12:45 439M
midar-iff.nodes.geo.bz2 2020-09-23 02:41 212M
=============================================================================
CAIDA's Macroscopic Internet Topology Data Kit
ITDK 2020-08
http://www.caida.org/data/active/internet-topology-data-kit/
=============================================================================
NOTE: The AS assignment files (*.as.bz2) were replaced with corrected
versions on Dec 11, 2020 (the files themselves have a modification
date of Sep 24/25, 2020). This doesn't change any of the AS
assignments that existed in the older files (with modification
dates of Sep 19, 2020); this fix only increases the amount of AS
assignments included in the files.
NOTE: The format of the .nodes files has slightly changed.
In ITDK release 2013-04 and earlier, we used addresses in
0.0.0.0/8 instead of 224.0.0.0/3 for non-real addresses.
NOTE: This README contains the full details of data collection that the
ITDK webpage lacks, so you will want to read over this file even
though some text duplicates the webpage (general description,
file formats, and data use terms).
The ITDK contains data about connectivity and routing gathered
from a large cross-section of the global Internet.
At present, this ITDK release consists of
(1) two related IPv4 router-level topologies,
(2) router-to-AS assignments,
(3) geographic location of each router, and
(4) DNS lookups of all observed IP addresses.
We plan to expand this release with other complementary datasets as
they become available (more details are available at the ITDK URL
above).
The two included IPv4 router-level topologies are generated from the same
IPv4-level topology but differ in the accuracy and completeness of the
alias resolution performed to create them. The first topology is
derived from aliases resolved with MIDAR and iffinder, which yield the
highest confidence aliases with few false positives. The second
topology also uses MIDAR and iffinder but further includes aliases
resolved with kapar, which significantly increases the coverage of
aliases but at the cost of false positives (which inflate the size of
routers and decrease the router count). Researchers should choose the
topology to use depending on the relative importance they place on
accuracy vs. comprehensiveness of alias resolution. Choose the most
accurate alias resolution if uncertain about which to use.
----------------------------------------------------------------------
Tools used:
* MIDAR: Monotonic ID-based Alias Resolution
http://www.caida.org/tools/measurement/midar/
http://www.caida.org/workshops/isma/1002/slides/aims1002_yhyun_midar.pdf
* iffinder: Mercator-style common source address alias resolution
http://www.caida.org/tools/measurement/iffinder/
* kapar: analytical alias resolver and topology generator
http://www.caida.org/tools/measurement/kapar/
http://www.caida.org/publications/papers/2010/alias_resolution/
* RouterToAsAssignment: analytical AS ownership resolver
http://www.caida.org/publications/papers/2010/as_assignment/
* qrrs: bulk DNS lookup tool
(in development)
* DDec: DNS Decoding database
http://ddec.caida.org
* MaxMind's free GeoLite City database
http://www.maxmind.com/app/geolitecity
* BordermapIT for AS assignments
http://www.caida.org/publications/papers/2016/bdrmap/
https://www.seas.upenn.edu/~amarder/resources/mapit.pdf
Source datasets:
* IPv4 Routed /24 Topology Dataset
http://www.caida.org/data/active/ipv4_routed_24_topology_dataset.xml
Data collection:
The MIDAR alias resolution run was performed 2020-08-22 to
2020-09-06 on 19 monitors (in 14 countries) using:
* 2.97 million addresses extracted from the IPv4 Routed /24
Topology Dataset ("Ark Routed /24 traces") for the period
2020-08-07 to 2020-08-20. We used 39 cycles of traces
(cycles 8674 to 8712, all from team 1) from 132 monitors
in 48 countries -- all active Ark monitors instead of the
subset used for MIDAR).
(The file itdk-run-20200819.addrs.bz2 contains the target addresses
used for the ITDK run.)
When extracting IP addresses from traceroute paths for the purposes
of using them as MIDAR and iffinder (see below for
details of the iffinder run), we only include addresses that could
potentially be routers; that is, we only include addresses that
appeared as an intermediate hop in some traceroute path, which means
we exclude the responding destination address from each trace.
For the kapar alias resolution run, we used the same set of
traces from the Routed /24 Topology Dataset as the MIDAR run
(see description above). These traces contributed the underlying
IP-level topology from which we constructed both router-level
topologies included in this ITDK.
NOTE: Unlike the MIDAR target list, the generated router-level graphs
also contain the responding destinations and Ark monitors as
nodes.
The iffinder alias resolution run was performed on 2020-08-27 during
the MIDAR run using the same target addresses as MIDAR. We
ran iffinder on 112 monitors, a superset of those used for MIDAR, with
each monitor independently probing the full set of iffinder targets
in a per-monitor randomized order.
For AS assignments, we used RIPE and RouteViews BGP tables, RIR
delegations, and PeeringDB.
We use a combination of publicly known Internet eXchange (IX) point
information, DDec hostname mapping, and MaxMind's free GeoLite City
database to provide the geographic location (at city granularity) of
routers in the router-level graph. See the ITDK web page for
further details on our geolocation method.
For details of the DNS names data collection, see the section
below describing the available DNS files and their formats.
----------------------------------------------------------------------
Each router-level topology is provided in two files, one giving the
nodes and another giving the links. There are also files that
assign ASes and geolocation to each node.
IPv4 Router Topology A (accurate alias resolution):
==================================================
midar-iffinder.nodes
midar-iffinder.links
Router topology based on aliases discovered by MIDAR and iffinder.
This topology contains fewer aliases, but has a low false positive
rate for aliases.
midar-iffinder.nodes.as
midar-iffinder.nodes.geo
IPv4 Router Topology B (comprehensive alias resolution):
=======================================================
midar-iffinder-kapar.nodes
midar-iffinder-kapar.links
Router topology based on aliases discovered by MIDAR, iffinder, and kapar.
The addition of the kapar algorithm makes this topology more complete
with respect to aliases, but also gives it a higher false positive
rate for aliases.
midar-iffinder-kapar.nodes.as
midar-iffinder-kapar.nodes.geo
File Formats:
============
.nodes
The nodes file lists the set of interfaces that were inferred to
be on each router.
Format: node <node_id>: <i1> <i2> ... <in>
Example: node N33382: 4.71.46.6 192.8.96.6 0.4.233.32
Each lines indicates that a node node_id has interfaces i_1 to i_n.
Interface addresses in 224.0.0.0/3 (IANA reserved space for multicast)
are not real addresses. They were artificially generated to identify
potentially unique non-responding interfaces in traceroute paths.
The IPv6 dataset uses IPv6 multicast addresses (FF00::/8) to indicate
non-responding interfaces in traceroute paths.
NOTE: In ITDK release 2013-04 and earlier, we used addresses in
0.0.0.0/8 instead of 224.0.0.0/3 for these non-real addresses.
.links
The links file lists the set of routers and router interfaces
that were inferred to be sharing each link. Note that these are
IP layer links, not physical cables or graph edges. More than
two nodes can share the same IP link if the nodes are all
connected to the same layer 2 switch (POS, ATM, Ethernet, etc).
Format: link <link_id>: <N1>:i1 <N2>:i2 [<N3>:[i3] .. [<Nm>:[im]]
Example: link L104: N242484:211.79.48.158 N1847:211.79.48.157 N5849773
Each line indicates that a link link_id connects nodes N_1 to
N_m. If it is known which router interface is connected to the
link, then the interface address is given after the node ID
separated by a colon (e.g., "N1:1.2.3.4"); otherwise, only the
node ID is given (e.g., "N1").
By joining the node and link data, one can obtain the _known_ and
_inferred_ interfaces of each router. Known interfaces actually
appeared in some traceroute path. Inferred interfaces arise when
we know that some router N_1 connects to a known interface i_2 of
another router N_2, but we never saw an actual interface on the
former router. The interfaces on an IP link are typically
assigned IP addresses from the same prefix, so we assume that
router N_1 must have an inferred interface from the same prefix
as i_2.
.nodes.as
The node-AS file assigns an AS to each node found in the nodes
file. We used BordermapIT to infer the owner AS of each node.
Format: node.AS <node_id> <AS> <method>
Example: node.AS N39 17645 election
Each line indicates that the node node_id is owned/operated by
the given AS, as inferred with the given method. There are three
inference methods:
1. single: a router has only a single choice of AS
2. election: multiple ASes are present on a router, and one AS
occurs more frequently than the rest
3. election+degree: multiple ASes are present on a router, but
no AS occurs the most frequently, so the choice is based on
AS degree
Addresses that belong to the address space of an Internet exchange
point (as self-identified in PeeringDB: https://www.peeringdb.com/)
are excluded from the AS analysis, as we don't consider them to be
part of the AS-level topology.
.nodes.geo
The node-geolocation file contains the geographic location of
each node in the nodes file. We first map each interface on a
router to a location. If all interfaces map to the same
location, then we assign that location to the router; otherwise,
we do not assign any location to the router (that is, the router
does not appear in the geolocation file).
Format: node.geo <node_id>: <continent> <country> <region> \
<city> <latitude> <longitude>
Example: node.geo N15: *** US HI Honolulu 21.3267 -157.8167
Each line indicates that the node node_id has the given geographic
location. The fields have the following meanings:
<continent>: currently always "***".
<country>: the two-letter ISO 3166 Country Code along with the
following codes specific to GeoLite City for
uncertain situations:
* A1: anonymous proxy,
* A2: satellite-based Internet provider,
* EU: Europe,
* AP: Asia/Pacific Region, and
* US: includes overseas US military bases.
<region>: for US/Canada, the two-letter ISO-3166-2 code for the
state/province, along with AA, AE, and AP for Armed
Forces America, Europe, and Pacific, respectively;
for outside US/Canada, the two-letter FIPS 10-4.
<city>: city or town in ISO-8859-1 encoding (up to 255 characters).
<latitude> and <longitude>: signed floating point numbers.
The above description is derived from the authoritative MaxMind
documentation available at http://www.maxmind.com/app/city#api
Additional references:
* ISO 3166 country codes:
http://www.maxmind.com/app/iso3166
* inside US/Canada ISO-3166-2 state/province codes:
http://www.maxmind.com/app/iso3166_2
* outside US/Canada FIPS 10-4 state/province codes:
http://www.maxmind.com/app/fips10_4
.ifaces
This file provides additional information about all interfaces
included in the provided router-level graphs:
Format: <address> [<node_id>] [<link_id>] [T] [D]
Each of the fields in square brackets may or may not be present.
Example: 1.0.174.107 N34980480 D
Example: 1.0.101.6 N18137917 L537067 T
Example: 1.28.124.57 N45020
Example: 11.3.4.2 N18137965 L537125 T D
Example: 1.0.175.90
<node_id> starts with "N" and identifies the node (alias set) to which
the address belongs. An address may not have a node_id if no aliases
were found.
<link_id> starts with "L" and identifies the link to which the address is
attached, if known. An address will not have a link_id if it was
obtained from a source other than traceroute or appeared only as the
first public address in a traceroute (i.e., the source and all other hops
preceeding this address were either private addresses or nonresponsive).
"T" indicates that the address appeared in at least one traceroute as a
transit hop, i.e. preceeded by at least one (public or private) address
(including the source) and followed by at least one public address
(including the destination). An address does not qualify as a transit
hop if it was seen only in these situations: it was obtained from a
source other than traceroute; it was the source or destination of a
traceroute; or it was the last responding public address to appear in a
traceroute.
"D" indicates that the address appeared in at least one traceroute as a
responding destination hop.
"T" and "D" are not mutually exclusive -- an address may have been a
transit hop in one traceroute and the destination in another.
An interface address will have "T" but not "L<link_id>" if it appeared
only as the first public address in a traceroute.
DNS Names:
=========
There are two related DNS names datasets, and you should choose the one to
use based on your specific needs:
1. If you would like to know what the DNS names were at about the time
that addresses were observed in the traces of the IPv4 Routed /24
Topology Dataset used for this ITDK, then you should download the
relevant portion of the IPv4 Routed /24 DNS Names Dataset, generated
with qrrs, from
http://www.caida.org/data/active/ipv4_dnsnames_dataset.xml
https://topo-data.caida.org/team-probing/list-7.allpref24/dns-names/
The traces used for this ITDK were collected Aug 7, 2020 to Aug 20,
2020. You should download DNS names files a few days before and
after this range.
2. On Sep 6, 2020, we performed additional DNS lookups with qrrs of
the 2.97 million MIDAR addresses in order to obtain DNS names
closer in time to the MIDAR and iffinder runs. These more timely
DNS lookups are better for extracting DNS-based ground truth that
can be compared with MIDAR and iffinder results. These DNS results
are available in the file
itdk-run-20200109-dns-names.txt.bz2
Each line contains three entries separated by tabs:
<timestamp> <IP-address> <DNS-name>
where <timestamp> is the timestamp of the lookup.
Please see the README of the IPv4 Routed /24 DNS Names
Dataset for full details about the encoding of special characters
in the <DNS-name> field.
=============================================================================
Data Use Terms and Conditions
=============================================================================
See https://topo-data.caida.org/acceptable-use-agreement.pdf
=============================================================================
Acknowledgments
=============================================================================
This product includes GeoLite data created by MaxMind, available from
http://www.maxmind.com/