Blocking bots


(Andrew Harris) #1

One of our site admins has asked a question about blocking access from search engines. For business reasons, the site is public, they just don't want it indexed.

 

They've used robots.txt correctly, and google has obeyed, but some other search engines haven't. On an apache server, we could play with htaccess and probably block them that way, but I read this article https://yoast.com/prevent-site-being-indexed/ with interest, and wondered if it was possible/sensible to implement this header in a parse file in a similar way to what I've done with mime-types.

<MySource_PRINT id_name="__global__" var="content_type" content_type="application/vnd.google-earth.kml+xml" />

Alternatively, is it possible to 'deny' a user group based on certain parameters (eg: user agent) when you have already granted public read? Or is that cart-before-horse?

 

Sorry, being a bit lazy here, asking before trying stuff out myself. I'm under pressure for quick answers and was just hoping someone had already travelled this path.


(Andrew Harris) #2

Actually, not so lazy - I did try the header thing, but couldn't get anything working ;-)


(Joel Porgand) #3

Pretty sure you'll have to set that up in Apache. 


(Nic Hubbard) #4

I have a feeling that this is never going to be completely possible no matter how hard you try.


(Anthony) #5

I thought there was a Meta tag you could include in headers that some search engines obey? Something like this

<meta name=“robots” content=“noindex”>


(Benjamin Pearson) #6

Was the robots.txt set up properly? Can you check the syntax was correct and it is public and accessible?


(Andrew Harris) #7

As I said in the original post… "They've used robots.txt correctly, and google has obeyed", it's the spiders which don't obey (robots.txt, or meta tags) that they are hoping to keep out.

 

If it was on a standard web server, I'd tell em to use htaccess to 'deny' whatever user-agent strings they wanted to try to block. I just wondered if there was any way of doing this in Matrix.

I haven't tested conditional membership of a 'denied' user group. I don't know if it's efficient/recommended, or even possible once you've granted public access, but this seems my only remaining option. I don't want to be adding to the server level htaccess on a site by site basis, especially when the reasons aren't at all critical.

 

To be honest, it if can't be done, I'm not going to lose sleep.

Thanks for the suggestions.


(Kequi) #8

Maybe use an IF statement in the parse code:

 

http://manuals.matrix.squizsuite.net/designs/chapters/show-if-design-area#user-agent 

 

It's probably an impossible task though - if it's publicly accessible on 'tinternet someones going to be able to get access.


(Andrew Harris) #9

@Karl - yeah, thanks. That's actually not a bad idea.

I realise it's a terrible use case. As you say, impossible, really.


(Bart Banda) #10

If your site is cached by a proxy as well, such as Squid, there won't be much you can do in Matrix either as the bots will probably hit your squid cached pages rather than matrix so no conditional things you put in matrix will have much effect. 


(Andrew Harris) #11

ahh - good point, Bart.

Thanks.