PDA

View Full Version : GEO Targeting Made EASY! (For PHP)



KLB
03-19-2007, 08:32 AM
Okay I've seen many threads with people wanting to easily implement geo targeting on their website for things like Yahoo. Well last night I implemented it on my site and I thought I'd share my methods with everyone else. The strengths of my methodology is that it does not require a database server as it works from a CSV file, It only requires adding one function to your site AND it is FREE.

Step 1) download a current IP "database" from http://software77.net/cgi-bin/ip-country/geo-ip.pl. There is a link to download the database part way down the right hand side of the page.

Step 2) extract the CSV file to a folder on your website.

Step 3) add the following function to your website scripts (I placed it in my "functions.inc" include). I should note that this function is a boiled down version of the script found at http://webnet77.com/scripts/geo-ip/index.html:

function ipcountrycode($ip){
// convert IP to decimal
$ip=sprintf("%u", ip2long($ip));

// set initial low
$low = 0;

// Open the csv file for reading
$csvfilename="ip2country.csv"; // Change to proper filename and path for IP dataset.
$fp = fopen($csvfilename, "r");

// Set initial high
fseek($fp, 0, SEEK_END);
$high = ftell($fp);

while ($low <= $high) {
$mid = floor(($low + $high) / 2); // C floors for you

//Seek to half way through
fseek($fp, $mid);

// Moves to end of line
if($mid != 0){
$line=fgets($fp);
}

// Read line
$ipdata=fgetcsv($fp,100);

if ($ip >=$ipdata[0] && $ip<=$ipdata[1]){
$low=999999999;
}
elseif($ip >$ipdata[0]){
$low = $mid + 1;
}
else {
$high = $mid - 1;
}
}
fclose($fp);
$line="";
return $ipdata[4];
}

Step 4) Call the above function from the top of the script and store the user's two digit country code in a string that can be referenced where needed:

$strCountryCode=ipcountrycode($REMOTE_ADDR);
The ipcountrycode function should really only be called once per page load and stored in a string variable for use throughout the script to reduce server overhead. To further reduce server load requirements, the afore mentioned function call can be replaced with the following code, which would set the country code as a cookie allowing the country code to be checked only once every seven days for users who allow cookies:

// Returns User's country code
//+++++++++++++++++++++++++++++
if(strlen(addslashes($_COOKIE['GeoLocation']))==2){
$strCountryCode=$_COOKIE['GeoLocation'];
}
else{
$strCountryCode=ipcountrycode($REMOTE_ADDR);

// Sets cookie for geo location.
$HostDomain=str_replace("http://","",$_SERVER['HTTP_HOST']);
$SetCookieExpire=mktime() +604800; //Set cookie for one week
setcookie ("GeoLocation", $strCountryCode, $SetCookieExpire, "/", $HostDomain, 0) ;
}


There you have it, a very simple and very clean way to get the country code of your users to do things like geo target ads. Oh make sure to periodically download updates to the IP CSV file so that your targeting remains as accurate as possible.

==EDIT==
2007-03-20:I replaced the function that read the entire file into memory with a function that did a binary compare on the file without reading the entire file into memory. Per rpanella's observations and critique. See post #14 (http://www.websitepublisher.net/forums/showthread.php?p=55944#post55944)

2007-03-23: Fixed some errors that caused country codes not to be returned some times. See post #23 (http://www.websitepublisher.net/forums/showthread.php?p=56107#post56107)

bassplaya
03-19-2007, 12:19 PM
Thanks dude, such easy setup rocks. Just curious if it will be good enough for sites with some real traffic.

Personally I'm using geoip php extension for ages, as easy as calling geoip_record_by_name($_SERVER['REMOTE_ADDR']); to get everything. Bit geeky to setup tho'

KLB
03-19-2007, 12:39 PM
Well my site gets around 20,000 page views per day and so far today it has gotten over 10,000 according to Google AdSense and I'm not seeing any problems.

Although the data table has some 150,000 rows, whomever created the original record search routine was sharp enough to use a bisection search, which allows the script to only look at a tiny fraction of the actual records.

Think of it this way, the script takes the middle record of the dataset and decides if the IP address is higher or lower than that record. If it is lower then it takes the record at the halfway point of the lower sub-set of records and again makes this determination. This continues until it gets to the correct record. This is one of the most efficient methods finding a matching record and allows the script to only need to every look at a very small fraction of the total records.

Here is what practically happens on a dataset of 100,000 records:
Check #1) 50,000 records eliminated. 50% remaining
Check #2) 25,000 records eliminated. 25% remaining
Check #3) 12,500 records eliminated. 12.5% remaining
Check #4) 6,250 records eliminated. 6.25% remaining
Check #5) 3,125 records eliminated. 3.13% remaining
Check #6) 1,562 records eliminated. 1.56% remaining
Check #7) 781 records eliminated. 0.78% remaining
Check #8) 390 records eliminated. 0.39% remaining
Check #9) 195 records eliminated. 0.20% remaining
Check #10) 98 records eliminated. 0.10% remaining
Check #11) 49 records eliminated. 0.05% remaining
Check #12) 24 records eliminated. 0.02% remaining
Check #13) 12 records eliminated. 0.01% remaining
Check #14) 6 records eliminated. 0.006% remaining
Check #15) 3 records eliminated. 0.003% remaining
Check #16) 2 records eliminated. 0.002% remaining
Check #17) 1 recored eliminated match made.

As you can see it is very very efficient.

Chris
03-19-2007, 01:24 PM
Couldn't you just import the textfile into MySQL and access it with 1 query


select location where ip = '$_SERVER['REMOTE_ADDR]';

KLB
03-19-2007, 01:41 PM
I didn't want to import into MySQL for a couple of reasons. One of them was I wanted to be able to update the dataset by simply overwriting the data file without having to purge and reimport the data. Basically I wanted the ease of updating and I wanted to keep the processing load on my Webserver, not the database server. When I tested it on my own laptop it didn't seem to hit the processor all that hard. I'd actually expect as big a hit from MySQL given the 150,000 records involved.

There isn't a record for every IP address, there are records for ranges and the IP addresses are stored as integers. So you would have to still convert 'REMOTE_ADDR' to an integer and then do a WHERE StartIP<=$REMOTE_ADDR AND EndIP>=$REMOTE_ADDR.

As they say there are different ways to skin a cat and given my current web hosting setup, it is preferable to put the load on the web server than it is on the MySQL server.

FPU
03-19-2007, 01:53 PM
What are you using it for Ken?

To deliver a different page based on country or state, please explain?

KLB
03-19-2007, 02:22 PM
I'm using it mostly to geo target ads. For instance I'm turning off my career listings links for countries where they are not appropriate. Basically I figure I can save my users some bandwidth and time by not serving up ads that don't apply to them. This also reduces clutter from useless links for those users.

Eventually I hope to pick up some country specific banner ads for my top four countries. For instance I could push out links to Canadian job listings to those in Canada instead of giving them links U.S. job listings.

For users of Yahoo's Publisher Network such geo targeting would allow them to turn off YPN ads to non-US users, which is necessary to stay in Yahoo's good graces.

Basically the smarter one targets their ads, the more effective those ads become.

rpanella
03-19-2007, 07:17 PM
What's the point of doing a binary search on it if you are already reading all the values into memory? The point of doing a binary search is so that you only have to look up (log n) values, but with this code you are still reading in the entire file, so you are not gaining any efficiency.

I would recommend using Maxmind's free GeoIP database and their PHP class which would be much more efficient and extremely simple to implement: http://www.maxmind.com/app/php
________
Blonde asian (http://www.****tube.com/categories/308/asian/videos/1)

Westech
03-19-2007, 08:03 PM
The point is that you spend less cycles searching through the values once they are in memory. It's more efficient to load all sequential values into memory and then do a binary search for the proper value than it is to load all values into memory and then, say, do a sequential search for the proper value.

I do agree with you about using Maxmind, though. It has the same benefits as KLB's (no database required, easy to implement, free, one file to replace to update the data). They release an updated IP file every month. It's pretty efficient, but if you're worried about using it for extremely high traffic applications it has a shared memory option you can invoke to cache the list in memory, and a mod_geoip apache module that you can install to make it run as a native binary (although doing it this way takes away the easy setup benefit).

KLB
03-19-2007, 08:26 PM
The disadvantage of GeoIP is that it is a module that must be installed, which isn't always an option on a shared hosting environment. I did at one point play with GeoIP and even have it installed on my laptop, but I did not like it. The method I've posted is the easiest I've found to implement and does not require one to "install" anything. It is also entirely free.

Now granted reading the entire file into memory isn't optimal, but the binary search does at least streamlines the process.

What would help is if one could do the binary search without loading the entire file into memory. The method above doesn't seem to cause problems with the 150,000 row dataset, but I wouldn't want to apply it to the IP dataset that was broken down to city level.

By the way, I did update the code some to clear the buffer and close the handle, which on my laptop (which contains an off line version of my site) seemed to reduce processor load).

rpanella
03-19-2007, 09:40 PM
The point is that you spend less cycles searching through the values once they are in memory. It's more efficient to load all sequential values into memory and then do a binary search for the proper value than it is to load all values into memory and then, say, do a sequential search for the proper value.

You have to load the entire file into memory first though so still O(n) (the time to search grows linearly with the number of records). A true binary search, which is what is done by Maxminds class or would be done with a mysql index if you imported it to mysql, only has to access log n records, so the time to search grows logarithmically. Log2 150000 = 17 accesses vs all 150k with the code above.

My point was if you are reading the entire array into memory, might as well check the value as you read it in and stop reading once you find it. To read the whole thing sequentially into memory, and then to a binary search makes no sense.



The disadvantage of GeoIP is that it is a module that must be installed, which isn't always an option on a shared hosting environment. I did at one point play with GeoIP and even have it installed on my laptop, but I did not like it. The method I've posted is the easiest I've found to implement and does not require one to "install" anything. It is also entirely free.

The Maxmind database is also free, and intsallation only requires uploading their database file and a php file with the class, which u include in any script you need it in. Westech actually wrote an article on it on this site.
________
Honda Cb125 Specifications (http://www.honda-wiki.org/wiki/Honda_CB125)

KLB
03-19-2007, 10:10 PM
The Maxmind database is also free, and intsallation only requires uploading their database file and a php file with the class, which u include in any script you need it in. Westech actually wrote an article on it on this site.
Maxmind's code is a little complex and quite convoluted for my brain this late at night, but as I dig through the pure PHP include version of their code, it appears that it essentially opens up their file and then reads it into memory, like mine does. If it doesn't, then one should be able to do the binary search as part of the process of reading the file with a heck of a lot less code than what GeoIP does as all we really need is to turn an IP address into a two digit country code.

I do not like putting really complicated code that I can not easily follow into my site's scripts. If for no other reason, I want to be able to fix it if something breaks (e.g. version change with PHP).

I do follow your logic of not reading the entire file into memory, and maybe I can accomplish this with something along the lines of what I posted above and without all the clutter of GeoIP.

rpanella
03-19-2007, 10:23 PM
I have not gone through Maxmind's code line by line but I know they use a binary format for their file, so that each record is the same exact size, and you can then calculate and read in any record in the file using fseek(). This means it only needs 17 reads instead of loading the entire file into memory each pageview.

All you need to get the countrycode with Maxmind is the geoip.inc file and their database file. All the other files are only if you need cities, regions, and other more specific information.
________
BRUNETTE LINGERIE (http://www.****tube.com/categories/431/lingerie/videos/1)

KLB
03-19-2007, 11:22 PM
Okay I didn't really figure out Maxmind's script (too much extraneous stuff), but when I went to PHP.net and looked up the function fseek() (http://www.php.net/manual/en/function.fseek.php) I found a function someone posted that I was able to merge with the function I originally posted above to create a binary search of the file without reading it into memory. Here it is for your inspection:


function ipcountrycode($ip){
$ip=ip2long($ip);

// IP data set file name (modify to the correct filename)
$csvfilename="ip2country.csv";
// Open the csv file for reading
$fp = fopen($csvfilename, "r");

fseek($fp, 0, SEEK_END);

$low = 0;
$high = ftell($fp);
$found=false;
while ($low <= $high && $found==false) {
$mid = floor(($low + $high) / 2); // C floors for you

//Seek to half way through
fseek($fp, $mid);

// Moves to proper line
if($mid != 0){
$line=fgets($fp);
}

// Read line
$line=fgets($fp);
$line = str_replace("\"", "",$line);
$ipdata = explode(",",$line);
if ($ip >=$ipdata[0] && $ip<=$ipdata[1]){
$found=true;
}
elseif($ip >=$ipdata[0]){
$low = $mid;
}
else {
$high = $mid;
}
}
fclose($fp);
$line="";
return $ipdata[4];
}

KLB
03-20-2007, 07:29 AM
Okay I just got off the phone with my web hosting provider after they temporarily disabled my site this morning and learned some very interesting things from our experiments. Doing a binary search on the file instead of reading the entire file into memory and then doing a binary search in memory actually took much more processing power. Enough in fact for them to deem it necessary to disable my site.

So, it would appear that I need to find a way to more efficiently read the file into memory and then crunch it there if I want to use a flat file.:mad:

I'm going to post the original function that read the file into memory and see if any great brains here can help me make the script more efficient.


function ipcountrycode($ip){

$ip=ip2long($ip);
// Open the csv file for reading
$csvfilename="ip2country.csv";
$handle = fopen($csvfilename, "r");

// Load array with start ips
$row = 1;
while (($buffer = fgets($handle, 4096)) !== FALSE) {
$array[$row] = substr($buffer, 1, strpos($buffer, ",") - 1);
$row++;
}

// Locate the row with our ip using bisection
$row_lower = '0';
$row_upper = $row;
while (($row_upper - $row_lower) > 1) {
$row_midpt = (int) (($row_upper + $row_lower) / 2);
if ($ip >= $array[$row_midpt]) {
$row_lower = $row_midpt;
}
else {
$row_upper = $row_midpt;
}
}
// Read the row with our ip
rewind($handle);
$row = 1;
while ($row <= $row_lower) {
$buffer = fgets($handle, 4096);
$row++;
}
fclose($handle);
$buffer = str_replace("\"", "", $buffer);
$ipdata = explode(",", $buffer);
$buffer="";
return $ipdata[4];
}

BusinessGeek
03-20-2007, 07:58 AM
I haven't been programming for years now and my skills are a bit rusty, to say the least, but one thing I can remember from reading binary files is that it can consume a lot of cycles when it's not done properly. I remember I once wrote a program in VB that would seek for specific strings using the elmination method you mentioned above, but the larger the file got the more it ate up from the process. Of course using "Windows" I learned a much faster way to conduct the search is to simply load the entire file into virtual memory where it can be read scores faster than through a physical medium.

Looking back on my earlier college days I remember one of my profs. was explaining the difference between reading from virtaul memory (such as RAM) and physical memory (such as your hard drive). It has a lot to do with distance really. The closer the data is to the cache memory the faster the process can cycle it. Even though you're still moving the data from the hard drive to the virtual memory by loading it in the RAM (which then later moves into cache memory) you're loading it in chunks. However, with a binary seek you're loading tiny portions (more calls to the physical memory are created).

However, the only time that would actually be a problem is when you are process the function at a tremendous rate. I don't see how that would be a problem here but I haven't tested out your code either.

KLB
03-20-2007, 12:19 PM
Okay, I've spent half the day cleaning up all of my scripts and changing the order in which they are called to reduce server overhead as much as possible. I don't think this script was a problem per say but that it was the last straw on the camel's back. I'm currently running on a cleaned up version of the function that does a binary compare without reading the file into memory. I seemed to get the best overall results with it. If anyone has suggestions as to ways I could further clean up this script I'd appreciate feedback.

Here's the latest version of it:

function ipcountrycode($ip){
$ip=ip2long($ip);

$low = 130;

// Open the csv file for reading
$csvfilename="ip2country.csv";
$fp = fopen($csvfilename, "r");
fseek($fp, 75000, SEEK_END);
$high = ftell($fp);

while ($low <= $high) {
$mid = floor(($low + $high) / 2); // C floors for you

//Seek to half way through
fseek($fp, $mid);

// Moves to proper line
//if($mid != 0){
$line=fgets($fp);
//}

// Read line
$line=fgets($fp);
$line = str_replace("\"", "",$line);
$ipdata = explode(",",$line);
if ($ip >=$ipdata[0] && $ip<=$ipdata[1]){
$low=$high+1;
}
elseif($ip >=$ipdata[0]){
$low = $mid + 1;
}
else {
$high = $mid - 1;
}
}
fclose($fp);
$line="";
return $ipdata[4];
}

rpanella
03-20-2007, 03:56 PM
That function looks fine to me. I don't see how it could be any more server intensive than your original function, and should infact be less.

Perhaps just the added load of you implementing geotargetting for the last few days is why they contacted you, not because you changed to doing a binary search. If this is the case, I would think it might be time to start thinking about upgrading your server.
________
Mflb (http://www.vaporshop.com/mflb-vaporizer.html)

KLB
03-20-2007, 05:11 PM
Rpanella, I came to the same conclusion as you about the server load. The scripts on my site represent years of accumulation and I decided it was time to strip out some detritus. It is amazing how much code bloat can creep in over time. I've spent most of today reviewing the code in my includes and using performance monitor on my laptop to see how much load my scripts put on Apache as I make little tweaks. I'm starting to see some tremendous improvement in my scripts and I haven't had any more complaints from my web host in regards to my Geo targeting. I do think the tweaks I made in my last version of the script helped as well.

KLB
03-20-2007, 10:27 PM
In my quest to reduce the impact geo location has on the server I "got smart" and decided to set a cookie that retains this information from page view to page view. This way if the user allows the cookie it will eliminate the need to constantly call the geo location function. Of course the security draw back is that the user can modify the cookie, so I'd recommend being careful how this method is deployed and to recognize this limitation. I figure in the case of geo targeting ads, this isn't really a big deal.

Here is the code to set the cookie and call the function:

if(strlen(addslashes($_COOKIE['GeoLocation']))==2){
$strCountryCode=$_COOKIE['GeoLocation'];
}
else{
$strCountryCode=ipcountrycode($REMOTE_ADDR);

// Sets cookie for geo location.
$HostDomain=str_replace("http://","",$_SERVER['HTTP_HOST']);
$SetCookieExpire=mktime() +604800; //Set cookie for one week
setcookie ("GeoLocation", $strCountryCode, $SetCookieExpire, "/", $HostDomain, 0) ;
}

Westech
03-21-2007, 07:38 AM
I like your cookie idea, but as you said, the security issue is a little scary. I wouldn't want to get tossed out of YPN over someone playing around with the cookies. Another idea might be to use a php session variable. I think I'm going to look into this for my sites.

KLB
03-21-2007, 08:13 AM
The cookie could be made a little more "secure" by simply encoding and decoding the country code. The cookie could contain two parts, the "encrypted" cookie and half the "salt". The other half of the "salt" could be hard coded into the source code. Then all you would need to do is combine the salts to decode the cookie. If the cookie was invalid geo targeting could be established via PHP. This would certainly address 99% of any mischief.

In regards to the YPN concerns, I really don't see this as a realistic concern for most sites out there. At most, I would expect people would modify the cookie to show them as being from some country that is unpopular with advertisers to trick the site into not pushing ads.

KLB
03-23-2007, 09:45 AM
Okay I found a serious flaw in my function that was causing the country code not to be returned for a large number of IP addresses. Apparently PHP's built in ip2long() function frequently returns negative values, which have to be fixed. Since I didn't know this (and apparently neither did the person I based my script on) about 50% of IP addresses were not returning a country code.

The fix is quite simple. All one needs to do is swap out the line of code that converts IP addresses into integers:
// Replace
$ip=ip2long($ip);

//With
$ip=sprintf("%u", ip2long($ip));

I've updated the original code I posted to reflect this fix and added a couple of other minor fixes that were causing some IPs not to return country codes.