On a couple of projects I’ve worked on in the past, I needed to determine the location of a computer based on its IP address. There are a few companies that sell data sets that contain such information. I used a company called Quova, which was acquired by Neustar in 2010 and renamed but I still call it Quova.
Every month Quova publishes a new data set that maps all IP addresses to location (because the information changes frequently). The data set is actually a text file with several million lines. Each line has about 29 fields, separated by the ‘|’ character. The first two fields are a start_ip and an end_ip. The remaining 27 fields on the line are things that map to all the IP address between the start_ip and the end_ip. Data fields include country, city, state, latitude and longitude, postal code, and so on.
To find the information associated with a particular target IP address, it’s not really feasible to simply loop through the data set text file one line at a time until you hit the target interval — the data set file is just too big. In principle, a good way to access the Quova information would be to transfer the data into a SQL database, index the start_ip and end_ip columns, and then do a select statement.
For one project I worked on, I needed to do lookups with data in memory instead of SQL. The problem is that the Quova data is too large to fit in a normal machine. The solution is to just load part of the data file (about 10% would fit on my machine) at any one time, and load a different chunk of data if necessary. This meant I had to create an index to know which lines of data had which IP addresses.