Setup a cache proxy with Squid

Today I’m going to explain how to setup a cache proxy within your local network. A cache proxy is a system that stores frequently accessed web objects for a fast retrieval, it works well with static contents such as html pages, css scripts, javascripts, images and even downloaded files if correctly configured.

This approach has some advantages:

  • on a congested network you can still open webpages faster because some contents doesn’t need to be retrieved from the internet but from a local cache (within your local network);
  • you can install a parental control and/or an antivirus to check what pages can be opened from the computers within the network (and properly configured to use the proxy).

Obviously there are some disadvantages, such as the fact that you can’t be sure that the cached objects are fresh (not changed) so you can encounter strange problems with websites; you can also encounter some problems with audio/video contents. Some of these problems can be avoided with properly configurations.

Let’s start with the installation and configuration of Squid on a home-server based on ArchLinux: the procedure is almost the same with other distributions.

First download the package using your package manager (pacman if you’re using ArchLinux):

pacman -S squid

Then you need to configure squid, to do so open /etc/squid.conf and read carefully the comments. There are a lot of options but you really need to check and change only a few of them:

  • http_port: the port where squid will listen for request, usually 3128 but you can change it without problems;
  • http_access: these lines defines the access permissions to the proxy, usually you want to allow access for localhost and localnet and then deny the access for everything else. To do so (it should be already into the default configuration file):
    # Define what is localnet
    acl localnet src
    acl localnet src
    acl localnet src
    acl localnet src fc00::/7
    acl localnet src fe80::/10
    # Enable localhost and localnet
    http_access allow localnet
    http_access allow localhost
    # And finally deny all other access to this proxy
    http_access deny all
  • cache_mgr: the email address for the cache_manager;
  • shutdown_lifetime: defines the time to wait until the service is stopped when required;
  • cache_mem: the memory (RAM) used as a buffer for requests: at least 256/512MB to have decent performance;
  • visible_hostname:  the hostname of your server;
  • fqdncache_size: the size of the resolved domain cache, use at least 1024;
  • maximum_object_size: the maximum size of objects in the cache, set this at least to 10MB otherwise you’ll only cache small files (no large images for example);
  • cache_dir: the location of the cache, this parameter is quite complex. It’s defined as:
    cache_dir ufs /var/cache/squid 20000 16 256: first the file system (ufs), then the location of the cache(/var/cache/squid), then the maximum size (20000MB or ~20Gb), then the number of folder at the first level (16) and finally the number of folder at the second level (256). To be honest you just have to change the maximum size to a serious amount such as 20-100GB. More cache means more files that doesn’t need to be retrieved from the internet.

After these initial configuration, where only two directly affect efficacy (cache_mem and cache_dir) there are some really important configuration that Squid uses to understand what and how the elements have to be cached.

The directives  uses a pattern that matches the objects by extension and/or name, then a minimum and maximum lifetime and a % is used to statistically determine when an item is stale and needs to be discarded, for example:

  • 10080 90% 43200: this means that the item is considered fresh if his time is between now and 10080 (3 hours) seconds ago, stale (discarded) if his time is older than 43200 (12 hours) second, if the time is between 10800 seconds and 43200 seconds the item is fresh with a 90% probably (high);
  • 1440 20% 10080: same as above, if time is less than 1440 the item is fresh, if the time is higher than 10800 the item is stale and finally if the time is between 1440 and 10080 the item is fresh with a 20% probability (low).

High % means that an object is unlikely to change, low % should be used for items that probably will ofter change. This is not an exact science, if an element changes (such as a new css of a newer version of a javasript) you still may load is your browser the older version, then never use a too long time (a day, not more). Be sure to read the official documentation for a more in-depth explanation:

My configuration is:

refresh_pattern ^ftp: 1440 20% 10080
refresh_pattern ^gopher: 1440 0% 1440
refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 10080 90% 43200 override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|swf|flv|x-flv)$ 43200 90% 432000 override-expire ignore-no-cache ignore-no-store ignore-no-private
refresh_pattern -i \.(deb|rpm|exe|zip|tar|tgz|ram|rar|bin|ppt|doc|tiff)$ 10080 90% 43200 override-expire ignore-no-cache ignore-no-store ignore-no-private
refresh_pattern -i \.index.(html|htm)$ 0 40% 10080
refresh_pattern -i \.(html|htm|css|js)$ 1440 40% 40320
refresh_pattern . 0 40% 40320

Let’s examine it line by line:

  • ftp are fresh under 1440 seconds and stale after 10080 but they are likely to change (20%)
  • gopher are fresh under 1440 and then stale
  • cgi-bin (scripts such as php) are never cached because you know, they change every time…
  • images are fresh under 10080 and stale after 43200 and they are unlikely to change (90%)
  • videos  are fresh under 43200 and stale after 432000 (5 days) and they are unlikely to change (90%)
  • archives are fresh under 10080 and stale after 43200 and they are unlikely to change (90%)
  • index pages of some sites are fresh until 10080 seconds
  • other html pages, css and javascript scripts are fresh under 1440 second and they stale after 40320
  • all other things are never fresh and they stale after 40320

For strange cases, such as windows update archives, you can find on the internet the line/s that you need to add. Keep in mind that the first line that matches is used so you need to order the rules in reverse order.

Finally enable and start the daemon. On ArchLinux, that uses systemd, this can be accomplished with these two commands:

systemctl enable squid
systemctl stop squid

A few final considerations:

  • install on your server a tool like webmin, this way you can check squid’s statistics to see the cache hit %;
  • remember that browser cache may alter the statistics since the object is retrieved locally and not on the squid cache, for testing purposes disable the browser cache and then set it to a lower amount (ssd disks will benefit and you save space);
  • more computers uses the caches, more the cache is fresh and more you can expect a higher cache-hit %;
  • to avoid over-kill Squid should be used on networks that have at least 2-3 computer, otherwise you’ll benefit only because you can have a huge cache (gigabytes not megabytes);
  • cache-hit should be at least 15-20% but don’t expect values such as 80-90% because https is never cached (and it’s better this way since to enable https you have to do things that it’s better to not do) and because not all the objects can be cached (such as php pages).

Next time I’ll show you how to configure and install an antivirus layer using clamav. As always if you have any questions feel free to contact me using the comments below 🙂