How-to not log personally identifiable information

 
funny-pictures-cat-shreds-paper.jpg

DuckDuckGo doesn't log personally identifiable information (PII). We simply don't save it.

Sometimes I get asked how to implement this privacy policy. It's pretty simple, but I wanted to explicitly spell it out in hope that others can more easily adopt similar practices.

The basic procedure is to go to everywhere you log stuff, and then drop all the PII where you see it being logged.  This procedure will probably amount to you dropping IP addresses and user agent strings from your Web server logs. For most Web apps, that's often all there is to it.

I use nginx (pronounced engine-x). Here's the default log format for nginx:

    log_format  main  '$remote_addr [$time_local] "$request" '
                      '$status $request_time $body_bytes_sent "$http_referer" '
                      '"$http_user_agent"';

The $remote_addr variable is the IP address and the $http_user_agent variable is the user agent, which can also unique identify people. You could just remove them, but that might break other log processing software. 

Instead, you can just replace them. Here's what I do:

        set $user_agent '';
        if ($http_user_agent ~* [\+\(]http) {
          set $user_agent 'Bot';
       }

        log_format  main  '127.0.0.2 [$time_local] "$request" '
                          '$status $request_time $body_bytes_sent "$http_referer" '
                          '"$user_agent"';

These changes have two effects. First, they will print 127.0.0.2 for everyone's IP address. Second, $http_user_agent becomes the $user_agent variable, which is blank for everyone but bots, which get logged as 'Bot'. I do that so I can exclude Bot traffic from reports. If you really wanted some user agent information you could simplify it to FF for Firefox, etc.

For Apache, it looks pretty similar, i.e.

    LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

becomes

    LogFormat "127.0.0.2 %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"-\"" combined

Then you're going to want to double check your application logs to make sure you're not writing IP addresses to them either. I honestly haven't used a lot of the modern frameworks, so I can't easily say whether this happens by default or not. 

Yes, it really could be as easy as changing one line in one file. Note that doing so doesn't prevent you from using Google Analytics. DuckDuckGo doesn't use it, but I think dropping PII from your logs is a step in the right direction regardless of whether you additionally use external analytics software. (I still am able to use awstats to produce reports like this.)

Also note that even if you don't want to commit to this forever, you can still do it today and start logging sometime in the future when the need arises. You don't even have to change your privacy policy as you'll be doing something more private anyway.

If you have some form of accounts, it is obvious that you may necessarily store some PII. However, that doesn't mean you have to store any for the random Web surfer who hits your site.

 

If you have comments, hit me up on Twitter.
I'm the Founder & CEO of DuckDuckGo, the search engine that doesn't track you. I'm also the co-author of Traction, the book that helps you get customer growth. More about me.