No browser caching for __data file assets?


(Nic Hubbard) #1

I always update a lot of PDF files on the www.puc.edu website. I do the update, then let the person know that it has been updated. Then, they inevitably check the file, and email me back letting me know that the files is NOT updated. So, I check it in Matrix and yes, it has a recent upload date, and the file is fine and updated when I preview it.


I suspect that there is some annoying caching going on with our __data files, and am wondering how I can prevent browsers from caching these? I just want users to ALWAYS be able to get the most recent __data files. We use Squid.



Here is the header for one of the PDF files:


    http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf
    
    GET /__data/assets/pdf_file/0008/67778/Assessment-Report.pdf HTTP/1.1
    Host: www.puc.edu
    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:15.0) Gecko/20100101 Firefox/15.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip, deflate
    Connection: keep-alive
    Referer: http://www.puc.edu/academics/departments/business-administration-economics/accreditation
    Cookie: __utma=187764009.1655779419.1338400422.1347045986.1347048139.62; __utmz=187764009.1345149796.44.5.utmcsr=Marketing|utmccn=Homepage%20Image%20Link:%20Biology%20Department|utmcmd=email; nmstat=133883177802471644; SQ_SYSTEM_SESSION=m7lehhak2ng0jup9jl39ef3je1; __utmc=187764009; __utmb=187764009.11.10.1347048139
    
    HTTP/1.0 200 OK
    Last-Modified: Thu, 08 Dec 2011 21:23:02 GMT
    Accept-Ranges: bytes
    Content-Length: 350021
    Content-Type: application/pdf
    Date: Wed, 05 Sep 2012 21:58:22 GMT
    Server: Apache/2.2.16 (Debian)
    Etag: "15955f-55745-4b39b45f79980"
    Age: 166902
    X-Cache: HIT from proxy.puc.edu
    X-Cache-Lookup: HIT from proxy.puc.edu:80
    Via: 1.1 www.puc.edu:80 (squid/2.7.STABLE9)
    Connection: keep-alive
    ----------------------------------------------------------


Why is the last modified date so old, even though the file HAS been updated? Cleared the Squid cache and nothing changed.

(David Schoen) #2

[quote]


Why is the last modified date so old, even though the file HAS been updated? Cleared the Squid cache and nothing changed.

[/quote]



I suspect there's something wrong with the ETag behaviour between Squid and Apache in your environment.



Apache should be generating an ETag based on the inode, mtime and file size, see: http://httpd.apache.org/docs/2.2/mod/core.html#fileetag



That bit looks ok, but assuming the file genuinely changed recently then the age header ("Age: 166902" = ~2 days) should be accordingly lower.



The fact that the age header is so large implies either:


  • the file was not actually updated (e.g. maybe it was first uploaded with a different filename within the same asset, if that's the case then you need to look in to lowering cache on the pages linking to the pdf)
    [*]squid is not checking the etag with apache when requests come in



Cheers,
Dave

(Nic Hubbard) #3

[quote]
The fact that the age header is so large implies either:


  • the file was not actually updated (e.g. maybe it was first uploaded with a different filename within the same asset, if that's the case then you need to look in to lowering cache on the pages linking to the pdf)
    [*]squid is not checking the etag with apache when requests come in

[/quote]

I am sure that I have changed the file.

At this point, what do I do? Is there a setting that I need to check in Squid? Squiz set it up, so I would have assumed all would have been taken care of.

EDIT: Just checked the file after the weekend, and the last-modified header had finally updated. Why has this taken so long? Is there something we need to do in the Squid conf?

(Nic Hubbard) #4

David, any help on this? Anyone else?


(Rhulse) #5

[quote]
David, any help on this? Anyone else?

[/quote]

Hi Nic!



To solve this you'll need to do a bit of detective work.



Set up a private folder in Matrix and upload a PDF to it. Make it live.



On your server do a curl to get the headers only direct from Matrix (pre squid).



Then get the headers via squid



Compare.



Then update the file and repeat, comparing the result.



Probably, somewhere, there is a header sent from Matrix to tell Squid to not go back and get the file.



This assumes that the name of the replacement file is the same as the existing file and that Matrix a) allows you do this and B) overwrites the old file.

(I cannot recall what the behaviour is - it has been sooooo long).



Also, is there a trigger that clear squid for updated assets? Does this work with static assets? Would it?





Cheers,



Richard


(Nic Hubbard) #6

[quote]


Also, is there a trigger that clear squid for updated assets? Does this work with static assets? Would it?



[/quote]



Thanks Richard! I will get working on that comparison.



I could setup a trigger, but I manually cleared the PDF cache and I still got the old headers.


(David Schoen) #7

Hi Nic,


I've just tested this on another system and it's not behaving as I recalled (sorry!), here's some clarification.



Here's a simplistic method for testing…



Create a simple file in __data on the server with touch:

    
    $ touch data/public/cache_test.txt


Then from your laptop/desktop watch how the cache object behaves:
    
    $ watch -d 'curl -sv http://example.com/__data/cache_test.txt 2>&1'


The important headers are Age, ETag and X-Cache:
    
    < ETag: "12d04a5-0-4c9751684c840"
    [...]
    < Age: 65
    < X-Cache: HIT from example.com


Also watch apache's log to see what requests are coming in:
    
    [root@foo ~]# tail -f /var/log/httpd/access_log | grep cache_test.txt
    127.0.0.1 - - [12/Sep/2012:09:10:50 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
    127.0.0.1 - - [12/Sep/2012:09:12:13 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
    127.0.0.1 - - [12/Sep/2012:09:13:54 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
    127.0.0.1 - - [12/Sep/2012:09:15:54 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
    127.0.0.1 - - [12/Sep/2012:09:18:18 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
    127.0.0.1 - - [12/Sep/2012:09:21:13 +1000] "GET /__data/cache_test.txt HTTP/1.0" 304 - "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"


On this (fairly stock) set up the Age header increases a while and then Squid performs an If-Modified-Since request to Apache with the ETag it has cached previously, if the ETag that Apache knows of it responds with "304 Not Modified" (shown in the log above) otherwise it responds with the entire file again and a "200 OK".

What you want to be trying to do is getting Squid to hold on to an object generating valid ETags (anything under __data) basically indefinitely, but with relatively frequent If-Modified-Since checks. This ensures that Squid can spool large files to low bandwidth users without tying up Apache processes, while still allowing the files to be refreshed fairly often.

What seems to be happening in the stock set up above is that as the file remains stable Squid gradually gains the confidence to hold on to it longer and longer, which really wasn't what I was expecting.

If I find a good fix, I'll make sure to post back here, but maybe the debug methodology and a little more understanding will be enough for someone else to post up a solution.


p.s. the Matrix trigger action "Squid Clear Cache" can be used to send the PURGE request to Squid as rhulse was referring to, which essentially tells Squid to drop the cache object regardless of the ETag or the age (if you can get it to fire in the right spots).


Cheers,
Dave

Edit: trimming an IP.

(Nic Hubbard) #8

[quote]


p.s. the Matrix trigger action "Squid Clear Cache" can be used to send the PURGE request to Squid as rhulse was referring to, which essentially tells Squid to drop the cache object regardless of the ETag or the age (if you can get it to fire in the right spots).



[/quote]



Thanks for the great info.



I used the normal "Clear Squid Cache" tool, and then watched the Live HTTP Headers in Firefox, and I didn't see the last-modified header get changed…found that strange, and it returned the same old file to me…


(David Schoen) #9

[quote]
Thanks for the great info.



I used the normal "Clear Squid Cache" tool, and then watched the Live HTTP Headers in Firefox, and I didn't see the last-modified header get changed…found that strange, and it returned the same old file to me…

[/quote]



The last-modified header should be the mtime (modified time) of the file on disk for assets coming from __data, this won't always change when the "Clear Squid Cache" tool is run (e.g. Squid cache may get cleared on a file that has not changed), the header to watch for is the "Age: " one.



The age header represents how old (in seconds) an object that Squid has successfully HIT for is in its cache. If you're getting an object out of Squid cache, its age should keep increasing as you send in subsequent requests - if you do a purge of that object it should start counting from 0 again.



You can do a purge manually (to unit test the components in your set up, rather than just testing the end-to-end process, which can make it much harder to isolate the failure) by looking in data/private/logs/error.log for what purge command is being used:

    
    [root@foo~]# grep PURGE /var/www/matrix/data/private/logs/error.log | tail
    [13-Sep-2012 09:18:17] Running: /usr/sbin/squidclient -h proxy.example.com -p 80  -m PURGE http://example.com/__data/1234/12345/blah.pdf


If you run this command by hand it should return either a 404 or a 200, e.g. for an object not in cache:
    
    HTTP/1.0 404 Not Found
    Server: squid
    Date: Wed, 12 Sep 2012 23:58:50 GMT
    Content-Length: 0


or for an object currently in cache:
    
    HTTP/1.0 200 Not Found
    Server: squid
    Date: Wed, 12 Sep 2012 23:58:50 GMT
    Content-Length: 0


anything else implies a configuration problem between the way Matrix is being told to clear Squid cache and how Squid is accepting the purge requests.

It's also worth noting that Firefox can hold on to objects as well and it's not incredibly clear from Firebug whether it has come from Firefox cache or Squid cache (you have to see whether the line is grey or black), I find it much easier just to use curl, as it won't cache on the client side, so you can test the server side caching independently of the Browser and then resolve Browser side issues once the server is working as expected.


Cheers,
Dave

(Nic Hubbard) #10

[quote]
The last-modified header should be the mtime (modified time) of the file on disk for assets coming from __data, this won't always change when the "Clear Squid Cache" tool is run (e.g. Squid cache may get cleared on a file that has not changed), the header to watch for is the "Age: " one.



The age header represents how old (in seconds) an object that Squid has successfully HIT for is in its cache. If you're getting an object out of Squid cache, its age should keep increasing as you send in subsequent requests - if you do a purge of that object it should start counting from 0 again.



[/quote]



You are correct that I can see the age getting higher when using curl a few times.



But, I JUST updated a PDF, and the age is not reset. (using the same filename) Therefore, I get the old file.



I guess I am still confused as to what setting I need to change. I realize there is a problem, but is there a clear answer and steps to take to fix Squid?



Do we have our squid settings wrong?







Should hostname be proxy.puc.edu which is what I get as the X-Cache-Lookup in the headers?


(Rwahyudi) #11

Hi Nic,


It is possible that squidclient is not clearing the right asset.



Try the following :





Go to proxy server and run :

    tail -f /var/log/squid/access.log | grep PURGE


Go to matrix server and clear the cache for that asset.

Go back to proxy server and you should see something along the line of
    [14/Sep/2012:09:23:23 +1000] [TCP_MISS:200] - >  "PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf" - 90b in 000ms referrer: -


Pay attention to the URL. Squid use URL as Unique identifier of it's cache objects.

If you dont get any PURGE request at all, then there is something wrong with clear squid cache tool.

Try clearing the cache manually from command line. Go to matrix server and manually run :
    /usr/bin/squidclient/squidclient -h proxy.puc.edu  -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf


Check squid log files and see if you get anything.

(Nic Hubbard) #12

[quote]
Try clearing the cache manually from command line. Go to matrix server and manually run :

    /usr/bin/squidclient/squidclient -h proxy.puc.edu  -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf


Check squid log files and see if you get anything.
[/quote]

I don't see any PURGE requests in the logs.

When trying to manually purge, I get the following back:

    HTTP/1.1 501 Method Not Implemented
    Date: Fri, 14 Sep 2012 19:03:11 GMT
    Server: Apache/2.2.16 (Debian)
    Allow: GET,HEAD,POST,OPTIONS
    Vary: Accept-Encoding
    Content-Length: 339
    Connection: close
    Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>
</head><body>
<h1>Method Not Implemented</h1>
<p>PURGE to /__data/assets/pdf_file/0008/67778/Assessment-Report.pdf not supported.<br />
</p>
<hr>
<address>Apache/2.2.16 (Debian) Server at www.puc.edu Port 80</address>
</body></html></pre><br />

Ideas?

(David Schoen) #13

[quote]
I don't see any PURGE requests in the logs.



When trying to manually purge, I get the following back:


    HTTP/1.1 501 Method Not Implemented
    Date: Fri, 14 Sep 2012 19:03:11 GMT
    Server: Apache/2.2.16 (Debian)
    Allow: GET,HEAD,POST,OPTIONS
    Vary: Accept-Encoding
    Content-Length: 339
    Connection: close
    Content-Type: text/html; charset=iso-8859-1
    
    
    
    501 Method Not Implemented
    
    

Method Not Implemented

PURGE to /__data/assets/pdf_file/0008/67778/Assessment-Report.pdf not supported.


Apache/2.2.16 (Debian) Server at www.puc.edu Port 80


Ideas?
[/quote]

Hey Nic,

This is happening because squidclient is trying to connect to Apache with the arguments you've provided, but only Squid can accept the PURGE method.

The hostname and port for the Clear Squid Cache Preferences settings need to be that of your Squid proxy, if they are on different hosts you'll need to put in the hostname or IP of that (if you're using a hostname, make sure it resolves correctly on the Matrix host).

So if Squid is listening on squid.example.com on port 80 then to test it on the command line, you would use:
    
    /usr/bin/squidclient -h squid.example.com  -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf


Responses of 200 or 404 are ok (as per post #9 - http://forums.squizsuite.net/index.php?showtopic=10009&view=findpost&p=46684), anything else is a problem.

The most common reason why would get something other than 200/404 when actually connecting to Squid is that the purge acls have not been set up in squid.conf:
    
    acl purgehosts src 127.0.0.1/32
    [...]
    acl PURGE method PURGE
    [...]
    http_access allow PURGE purgehosts
    http_access deny PURGE


Note: you will have to change the first "acl purgehosts ..." line to include any hosts you want to do PURGEs from, 127.0.0.1 is only going to work if both Apache and Squid live on the same host.

So for example if I try to do a PURGE on your server from here I get rejected with 403 because I'm not in the acls allowing it (which is a good thing!):
    
    $ /usr/bin/squidclient -h www.puc.edu  -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf
    HTTP/1.0 403 Forbidden
    Server: squid/2.7.STABLE9
    [...]


You want to make sure only your Matrix host can do the PURGE requests.

Once you have it working on the command line, configure Matrix's Clear Squid Cache Preferences settings and then test the Squid Clear Cache Tool.

If the manual tool works, then you can set up the triggers to automate it.


Cheers,
Dave


Edit: typos + purposefully failing squidclient command.

(Nic Hubbard) #14

[quote]
This is happening because squidclient is trying to connect to Apache with the arguments you've provided, but only Squid can accept the PURGE method.



The hostname and port for the Clear Squid Cache Preferences settings need to be that of your Squid proxy, if they are on different hosts you'll need to put in the hostname or IP of that (if you're using a hostname, make sure it resolves correctly on the Matrix host).



So if Squid is listening on squid.example.com on port 80 then to test it on the command line, you would use:

    
    /usr/bin/squidclient -h squid.example.com  -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf

[/quote]



What exactly is the visible_hostname? Just whatever we want? Right now it is proxy.puc.edu which isn't even a real hostname.



I checked our Clear Squid Cache Prefs in Matrix and copied exactly what is there.



I am trying:


    /usr/bin/squidclient -h 127.0.0.1 -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf


Strangely, still seeing the 501 error.

Should I post my squid.conf?

(Nic Hubbard) #15

David, any more help on this? Please see my post above.


(David Schoen) #16

[quote]
What exactly is the visible_hostname? Just whatever we want? Right now it is proxy.puc.edu which isn't even a real hostname.



I checked our Clear Squid Cache Prefs in Matrix and copied exactly what is there.



I am trying:


    /usr/bin/squidclient -h 127.0.0.1 -p 80 -m PURGE http://www.puc.edu/__data/assets/pdf_file/0008/67778/Assessment-Report.pdf


Strangely, still seeing the 501 error.

Should I post my squid.conf?
[/quote]

The squid.conf might be useful, but would be more useful is knowing if it's running on the same host or not. The 501 error you posted earlier came from Apache, not Squid, so Matrix is configured to talk on either the wrong host or port to Squid (i.e. it's talking to Apache instead of Squid). It's probably easiest to get the squidclient command working first before trying to get the settings in Matrix correct (which are definitely incorrect at the moment).

visible_hostname can be set to whatever you want.

I can't tell if your squid.conf is appropriate until you adjust the squidclient arguments to purge against Squid, rather than apache.

You can use netstat to see where squid is actually listening:
    
    [root@foo-proxy ~]# netstat -natpl | egrep 'LISTEN.*squid'
    tcp        0      0 :80             0.0.0.0:*                   LISTEN      2403/(squid)        
    tcp        0      0 :443            0.0.0.0:*                   LISTEN      2403/(squid) 


p.s. you mentioned earlier that Squiz set up Squid initially - are you able engage Squiz Support again to audit which bit of config is currently out of sync?


Cheers,
Dave

Edit: mentioning visible_hostname.

(Nic Hubbard) #17

Thanks Dave. I think I will contact support about this and see if they missed a step when they install Squid about a year ago.