How to Extract Domain Names from VeriSign’s .COM Zone File

For those of you not familiar with it, the zone file is a file generated daily that lists all of the active domain names and corresponding name servers for a particular Top Level Domain (TLD) like .COM, .NET, etc.

VeriSign managers the .COM zone file and in the last post I explained how to obtain access to download VeriSign’s .COM zone file. In this tutorial I’ll explain how to parse the zone file to extract just the domain names and strip out everything else.

Working with the zone file

The size of the uncompressed zone file varies by day but you can figure it will be somewhere around 8.5 GB:

Any attempt you make to open the entire zone file in a text editor will likely crash your computer, so you have to resort to working with it via the command line (the rest of this tutorial assumes you are using a Unix-like machine such as Mac OS X or Linux).

For this tutorial we’re going to work with only the first 100 lines of the zone file because it makes it easy to show you what’s going on step by step. These steps would also work on the entire zone file, though it would take significantly longer to execute each of the commands.

1. Navigate to the location of the zone file

For this tutorial we’re only going to work with the first 100 lines of the zone file because it’s easier to show you what’s going on with each command we run. If you’d like to follow along, you can download the truncated zone file here.

After you’re in the folder with the zone file (or zone file example) you should be able to run ls -l and see the zone file:

terminal. $ ls l // rw-r—r 1 matt staff 4973 Oct 4 12:06 com.zone.example.txt

You can confirm the number of lines in the zone file by running wc -l:

terminal. $ wc -l com.zone.example.txt // 100 com.zone.example.txt

You can view the contents of the example zone file by opening it up in a text editor (again, don’t try this on the full zone file) or by running cat:

terminal. $ cat com.zone.example.txt

Here’s what the contents of the example zone file look like:

textfile. ; The use of the Data contained in Verisign Inc.‘s aggregated//; .com, and .net top-level domain zone files (including the checksum//; files) is subject to the restrictions described in the access Agreement//; with Verisign Inc.////$ORIGIN COM.//$TTL 900//@ IN SOA a.gtld-servers.net. nstld.verisign-grs.com. (// 1349069930 ;serial// 1800 ;refresh every 30 min// 900 ;retry every 15 min// 604800 ;expire after a week// 86400 ;minimum of 15 min// )//$TTL 172800// NS A.GTLD-SERVERS.NET.// NS G.GTLD-SERVERS.NET.// NS H.GTLD-SERVERS.NET.// NS C.GTLD-SERVERS.NET.// NS I.GTLD-SERVERS.NET.// NS B.GTLD-SERVERS.NET.// NS D.GTLD-SERVERS.NET.// NS L.GTLD-SERVERS.NET.// NS F.GTLD-SERVERS.NET.// NS J.GTLD-SERVERS.NET.// NS K.GTLD-SERVERS.NET.// NS E.GTLD-SERVERS.NET.// NS M.GTLD-SERVERS.NET.//COM. 86400 DNSKEY 256 3 8 AQOl6m3T1q7fQhwJYFfWJO/IvVmtfI2Eg2wX4UR9jcl/qaTiMp+7Kx7baGOsPvZwX4lVGYWif955l4lLh/VnnNJvjDxBWVQcDrH3cHzFAaq9QXZPcEk7UyTOBL1piVpB2dqJzbO2bH9XGFiOXPUj3nhQ7mxvW0bgRiKv9Qah/7NH2w//COM. 86400 DNSKEY 257 3 8 AQPDzldNmMvZFX4NcNJ0uEnKDg7tmv/F3MyQR0lpBmVcNcsIszxNFxsBfKNW9JYCYqpik8366LE7VbIcNRzfp2h9OO8HRl+H+E08zauK8k7evWEmu/6od+2boggPoiEfGNyvNPaSI7FOIroDsnw/taggzHRX1Z7SOiOiPWPNIwSUyWOZ79VmcQ1GLkC6NlYvG3HwYmynQv6oFwGv/KELSw7ZSdrbTQ0HXvZbqMUI7BaMskmvgm1G7oKZ1YiF7O9ioVNc0+7ASbqmZN7Z98EGU/Qh2K/BgUe8Hs0XVcdPKrtyYnoQHd2ynKPcMMlTEih2/2HDHjRPJ2aywIpKNnv4oPo///COM. 86400 NSEC3PARAM 1 0 0 -//COM. 86400 RRSIG NSEC3PARAM 8 1 86400 20121006041758 20120929030758 47783 COM. chypT55+F8iGvkLS2TVSiqonms7mRNjDe2g560bqSulngD7z4y+4qsz13ZOGY4yrM9lBOfGYtNxpai3Q9TAT9n0X2L/3cJDY/xyA5LFSC0ilRvm+zr41d5TRwuf/GMdj7pfN2w6IoSxYckgPczqWHG2yOpX6EPXuIPW5E2L8GZU=//COM. RRSIG NS 8 1 172800 20121006041758 20120929030758 47783 COM. hfdySh/hHeA0zNLcbLQMNtRsXcOVKzH7vGED0t8IbkdaOTeuSFi0E8vXMVUJDjK9hlVYsCa4bE5wh5X61pIKkI9SjyCDjUK92ZpG/2+rtHeYWRbREAMpgcZ4FAySSknskHOnkUa4c/0tA9ZOJ0AkNzxztUr+KinlC+Co8rp5aGg=//COM. 900 RRSIG SOA 8 1 900 20121008053850 20121001042850 47783 COM. odeDdoJS/JVKOMNcdDd4Oh8MnY2DoKobagNU44AKjYE9GuQ3sBgbXmyH3JOrS6a7iBmFexN6UAdLSNcCozOO0Ta51WQFcuJhbvZwhXNrjOH50pkcG7Xw9pzwlOrftj9R7pHwCDEagZp20GGtbGATf946D6CCUJSBmtZ8pqoEu7s=//COM. 86400 RRSIG DNSKEY 8 1 86400 20121005182533 20120928182033 30909 COM. nPBzPp1A3EBgwjf3IrrYVgVh0YcVqdd6YKQ4CeraP5vK8nIyUMqGMnLc2ykA/BWb8AtAdg6KiOVsXl+4dkkqijccbt8mEzUZ6aD3Gd1IT13K5uDq4tjhxaQTRkloZU1TC4FfRhe5DHQSHzTmOWn9ClqonMa2FeNaf9rlsaNCaWq4fctndbPhuhuN0m9EKSh0So8WhM/5wZqjsie9+S2yBPsxakXWTA3zwxR7y9sqfabfmH+KmrQRF2lCXxhF//of4zp3VLpG9UK1kS/4mQTdm8kNRzfgNgCKo1ejS4uMj5g0rS6n5aZvk8PfeVbBlhnVb3oDRImz/RIhZJ1×0w3kzA//ENERCONTECHNOLOGIES NS NS1.BIZ.RR//ENERCONTECHNOLOGIES NS NS2.BIZ.RR//SELFDRIVECARRENTAL NS NS9.IZP//SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE.//SELFDRIVECARRENTAL NS NS8.FOR-SALE-IF-THEPRICE-IS-RIGHT//NANCYVRAINE NS NS1.IMCONLINE.NET.//NANCYVRAINE NS NS2.IMCONLINE.NET.//SELFDRIVECARRENTAL NS NS9.IZP//SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE.//SELFDRIVECARRENTAL NS NS8.FOR-SALE-IF-THEPRICE-IS-RIGHT//WORLDDATASOURCE NS NS01.DOMAINCONTROL//WORLDDATASOURCE NS NS02.DOMAINCONTROL//SAUDIPHOTOGRAPHERS NS NS1.R4L//SAUDIPHOTOGRAPHERS NS NS2.R4L//MERCKCHOICE NS CBRU.BR.NS.ELS-GMS.ATT.NET.//MERCKCHOICE NS CMTU.MT.NS.ELS-GMS.ATT.NET.//ENVIRONMENTALSCHOOLS NS PSNS01.PAULSMITHS.EDU.//ENVIRONMENTALSCHOOLS NS PSNS02.PAULSMITHS.EDU.//EASTHAMPTONHOMES NS BUY.INTERNETTRAFFIC//EASTHAMPTONHOMES NS SELL.INTERNETTRAFFIC//AMERICASHOMEBUILDER NS BUY.INTERNETTRAFFIC//AMERICASHOMEBUILDER NS SELL.INTERNETTRAFFIC//BOVINUS NS C3P0.CBFENTERPRISES//BOVINUS NS R2D2.CBFENTERPRISES//CONSTELLATIONCOLLEGE NS NS1.SEDOPARKING//CONSTELLATIONCOLLEGE NS NS2.SEDOPARKING//DOCHERTYCONSULTING NS NS1.VERINOTE.NET.//DOCHERTYCONSULTING NS NS3.VERINOTE.NET.//SONOMETRICS NS NS35.WORLDNIC//SONOMETRICS NS NS36.WORLDNIC//UNLIMITEDDISCOUNTPHONECALLS NS DNS1.NAME-SERVICES//UNLIMITEDDISCOUNTPHONECALLS NS DNS2.NAME-SERVICES//UNLIMITEDDISCOUNTPHONECALLS NS DNS3.NAME-SERVICES//UNLIMITEDDISCOUNTPHONECALLS NS DNS4.NAME-SERVICES//UNLIMITEDDISCOUNTPHONECALLS NS DNS5.NAME-SERVICES//FREILAND NS NS1.FABULOUS//FREILAND NS NS2.FABULOUS//KUMANET NS UNS01.LOLIPOP.JP.//KUMANET NS UNS02.LOLIPOP.JP.//SANGYOSHIEN NS NS55.WORLDNIC//SANGYOSHIEN NS NS56.WORLDNIC//JONATHANCHARLESNOVAK NS DNS077.A.REGISTER//JONATHANCHARLESNOVAK NS DNS030.B.REGISTER//JONATHANCHARLESNOVAK NS DNS030.C.REGISTER//JONATHANCHARLESNOVAK NS DNS010.D.REGISTER//HQSINGAPORE NS NS41.DOMAINCONTROL//HQSINGAPORE NS NS42.DOMAINCONTROL//PANASOURCE NS F1G1NS1.DNSPOD.NET.//PANASOURCE NS F1G1NS2.DNSPOD.NET.//PRIVATESAUNAS NS NS.BUYDOMAINS//PRIVATESAUNAS NS THISDOMAINFORSALE//BARBARASTREISAND NS NS1.LAMEDELEGATION.NET.//BARBARASTREISAND NS NS2.LAMEDELEGATION.NET.//MONICAMAGNETTI NS NS21.DOMAINCONTROL//MONICAMAGNETTI NS NS22.DOMAINCONTROL//IGUANAWORLD NS DNS1.TNIB.DE.//IGUANAWORLD NS DNS2.TNIB.DE.//IGUANAWORLD NS DNS3.TNIB.DE.//PERFECTDAYSTUDIOS NS NS2.DYNADOT//PERFECTDAYSTUDIOS NS NS1.DYNADOT//SVCROSS NS NS1.ZONEEDIT//SVCROSS NS NS5.ZONEEDIT//EBEIJING NS NS1.PEER1.NET.//EBEIJING NS NS2.PEER1.NET.//NASHSATTERFIELD NS HOME.GIS.NET.

For the purposes of this tutorial you can ignore the first 35 lines or so. The first domain name the file contains is EnerconTechnologies:

textfile. ENERCONTECHNOLOGIES NS NS1.BIZ.RR // ENERCONTECHNOLOGIES NS NS2.BIZ.RR // SELFDRIVECARRENTAL NS NS9.IZP // SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE. // SELFDRIVECARRENTAL NS NS8.FOR-SALE-IF-THEPRICE-IS-RIGHT // …

Notice that there’s one entry for each name server associated with the domain name.

You can confirm the name server’s are correct by running whois on a domain name and checking whether the name servers are are identical:

terminal. $ whois ENERCONTECHNOLOGIES.COM////Whois Server Version 2.0////Domain names in the .com and .net domains can now be registered//with many different competing registrars. Go to http://www.internic.net for detailed information.//// Domain Name: ENERCONTECHNOLOGIES.COM// Registrar: NETWORK SOLUTIONS, LLC.// Whois Server: whois.networksolutions.com// Referral URL: http://www.networksolutions.com/en_US//Name Server: NS1.BIZ.RR.COM// Name Server: NS2.BIZ.RR.COM// Status: clientTransferProhibited// Updated Date: 03-mar-2012// Creation Date: 03-mar-1999// Expiration Date: 03-mar-2022

Extracting just the domain names

In order to extract just the domain names we’ve got to run a series of commands so that all that’s left when we’re done is a list of the domain names. We’ll do this step by step, though you could easily pipe (|) these commands together to achieve the same result.

Notice that the line with domain name and name server is formatted consistently: it’s the domain name, then a space, then the name server. If we want to extract just the domain names, then we can run a command that will extract everything before the first space. To do this, we use the awk command and send the output to first.com.zone.example.txt:

terminal. $ awk ‘{print $1}’ com.zone.example.txt > first.com.zone.example.txt

If you examine the resulting first.com.zone.example.txt you’ll notice a much cleaner output:

textfile. ;//;//;//;////$ORIGIN//$TTL//@//1349069930//1800//900//604800//86400//)//$TTL//NS//NS//NS//NS//NS//NS//NS//NS//NS//NS//NS//NS//NS//COM.//COM.//COM.//COM.//COM.//COM.//COM.//ENERCONTECHNOLOGIES//ENERCONTECHNOLOGIES//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//NANCYVRAINE//NANCYVRAINE//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//WORLDDATASOURCE//WORLDDATASOURCE//SAUDIPHOTOGRAPHERS//SAUDIPHOTOGRAPHERS//MERCKCHOICE//MERCKCHOICE//ENVIRONMENTALSCHOOLS//ENVIRONMENTALSCHOOLS//EASTHAMPTONHOMES//EASTHAMPTONHOMES//AMERICASHOMEBUILDER//AMERICASHOMEBUILDER//BOVINUS//BOVINUS//CONSTELLATIONCOLLEGE//CONSTELLATIONCOLLEGE//DOCHERTYCONSULTING//DOCHERTYCONSULTING//SONOMETRICS//SONOMETRICS//UNLIMITEDDISCOUNTPHONECALLS//UNLIMITEDDISCOUNTPHONECALLS//UNLIMITEDDISCOUNTPHONECALLS//UNLIMITEDDISCOUNTPHONECALLS//UNLIMITEDDISCOUNTPHONECALLS//FREILAND//FREILAND//KUMANET//KUMANET//SANGYOSHIEN//SANGYOSHIEN//JONATHANCHARLESNOVAK//JONATHANCHARLESNOVAK//JONATHANCHARLESNOVAK//JONATHANCHARLESNOVAK//HQSINGAPORE//HQSINGAPORE//PANASOURCE//PANASOURCE//PRIVATESAUNAS//PRIVATESAUNAS//BARBARASTREISAND//BARBARASTREISAND//MONICAMAGNETTI//MONICAMAGNETTI//IGUANAWORLD//IGUANAWORLD//IGUANAWORLD//PERFECTDAYSTUDIOS//PERFECTDAYSTUDIOS//SVCROSS//SVCROSS//EBEIJING//EBEIJING//NASHSATTERFIELD

Not bad, but there’s a lot of duplicates because domain names are listed once for each name server, so let’s clean the file up a bit by removing duplicates and sorting the results alphabetically:

terminal. $ sort -u first.com.zone.example.txt > sorted_and_unique.com.zone.example.txt

Here we use the sort command with the u switch to sort the file and remove the duplicates.

The new sorted_and_unique.com.zone.example.txt is looking pretty good:

textfile. //$ORIGIN//$TTL//)//1349069930//1800//604800//86400//900//;//@//AMERICASHOMEBUILDER//BARBARASTREISAND//BOVINUS//COM.//CONSTELLATIONCOLLEGE//DOCHERTYCONSULTING//EASTHAMPTONHOMES//EBEIJING//ENERCONTECHNOLOGIES//ENVIRONMENTALSCHOOLS//FREILAND//HQSINGAPORE//IGUANAWORLD//JONATHANCHARLESNOVAK//KUMANET//MERCKCHOICE//MONICAMAGNETTI//NANCYVRAINE//NASHSATTERFIELD//NS//PANASOURCE//PERFECTDAYSTUDIOS//PRIVATESAUNAS//SANGYOSHIEN//SAUDIPHOTOGRAPHERS//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//SONOMETRICS//SVCROSS//UNLIMITEDDISCOUNTPHONECALLS//WORLDDATASOURCE

The last problem is that there are a number of lines left over that can’t possibly be domain names because they contain invalid characters such as $ORIGIN and COM.. We’ll use egrep (or grep -e) to extract only the lines that are valid domain names:

terminal. $ egrep ‘^[A-Z0-9]([A-Z0-9-]{0,61}[A-Z0-9])?$’ //sorted_and_unique.com.zone.example.txt > domains.com.zone.example.txt

In case you’re curious, that regular expression finds strings that:

  1. Start and end with a letter of a number (we can just look at uppercase letters because that’s how all the domain names in the zone file are formatted)
  2. Contain letters, numbers, or dashes in between
  3. Are 1 to 63 characters in length

At last, we have a file we can work with:

textfile. 1349069930//1800//604800//86400//900//AMERICASHOMEBUILDER//BARBARASTREISAND//BOVINUS//CONSTELLATIONCOLLEGE//DOCHERTYCONSULTING//EASTHAMPTONHOMES//EBEIJING//ENERCONTECHNOLOGIES//ENVIRONMENTALSCHOOLS//FREILAND//HQSINGAPORE//IGUANAWORLD//JONATHANCHARLESNOVAK//KUMANET//MERCKCHOICE//MONICAMAGNETTI//NANCYVRAINE//NASHSATTERFIELD//NS//PANASOURCE//PERFECTDAYSTUDIOS//PRIVATESAUNAS//SANGYOSHIEN//SAUDIPHOTOGRAPHERS//SELFDRIVECARRENTAL//SELFDRIVECARRENTAL//SONOMETRICS//SVCROSS//UNLIMITEDDISCOUNTPHONECALLS//WORLDDATASOURCE

The one thing that’s inaccurate about this list are the numbers which are not actually domain names though they are listed as such at the top of the list. You could remove these, though the resulting list is more than accurate enough to analyze and use at this point (and all of those numbers do have corresponding .coms except for 1349069930 anyway).

Hope you found this tutorial useful. If you have any questions or notice anything that can be done more efficiently please drop me a note matt@leandomainsearch.com.

Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: