Excluding type 2 linked assets from sitemap.xml


(Evan Wills) #1

We are using an asset listing to generate our sitemap.xml which works fine except that we are getting Link "Type 2" URLs showing up in the site map. The problem is that when I change "Direct Links Only" to "Yes" the number of URLs drops from 9,673 to 54 (loosing 9,619 URLs), which I know is much fewer Link "Type 1" assets than we have on the site.

 

Is there a way of excluding Type 2 assets from a listing without using the "Direct Links Only" > "Yes"?

 

I've had a look through paint layout conditional keywords and keyword modifiers but none of these seem to offer any solution. My current solution is to create a recursively nested asset listing but that's going to really hurt when it's generated (if I can get it to work).


(Evan Wills) #2

Update

 

The recursive asset listing works... Sort of...

 

When listing less than a couple of thousand URLs it renders, however, when I get over a certain limit (I haven't worked out how many) I get a "504 Gateway Time-out".

 

For anyone who want's to try this at home:

 

I'm using two asset listings.

The first one just uses static root nodes and a custom design to deliver "application/xml" mime type. The "Default Format" has two <DIV>s the first renders the XML for the URL, the second nests the second listing

The second asset listing uses the same static root nodes but with a "Dynamic Parameters" - SESSION Variable Name: list_current_asset_id. The "Default Format" has a linked XML <DIV> from the first asset listing and a nested content div nesting the listing itself.

 

Because we don't want to list certain types of assets but do want to list their children, I'm using custom formats for those assets but not rendering the XML for the URL, only nesting the listing itself. (if that makes any sense at all)


(Tim Davison) #3

You should be able to do this with one asset listing.  In the details section there should be link types to limit it by.  Default is TYPE1+2 for asset listings.  Just changes this to TYPE1 only and that should do what you want. 

 

Direct links only will be limiting you to single parent-child set, which I suppose you could nest recursively in theory, but I'm not sure the 'list_current_asset_id' values updates itself like that, i.e. when it exists one recursive scope I don't think it returns to the enclosing context.


(Evan Wills) #4

Hi Tim

 

When I turn on Type 1 only, I loose all the grand child (and their decedent) assets of the assets of the root node(s).

The recursive "list_current_asset_id" technique works when I'm listing a subsection of my site (say 3,000 assets) but times out when I try to list my whole site.


(Tim Davison) #5

That's odd.  Make sure direct links only is turned off.  It should follow all TYPE1 links and include the children.

 

Or are there TYPE1's under TYPE2's?  If that's the case then no, they won't be followed. But how would you represent them in the sitemap anyway?

 

E.g.

Root
  |- a (TYPE1)
  |  |- a-a (TYPE1)
  |  |- a-b (TYPE1)
  |  |- a-c (TYPE2)
  |
  |- b (TYPE2)
  |  |- b-a (TYPE1)
  |  |- b-b (TYPE1)
  |  |- b-c (TYPE2)

I would only expect a, a-a, and a-b to be listed.  Nothing under b should be listed.  If you are trying to get the TYPE1's under b to show up, then yeah, you'll need some kind of recursion as you are doing.

 

I was considering if there was a keyword so you could display everything and just hide TYPE2's, or use some if/else to show or hide it but I can't see any keywords for the type linking.

 

And the timeout, I reckon that's just processing of the recursion taking a long time.  Can check /_performance but I reckon you're over the limit.  Can possibly increase the timeout limit for your system, let the system cache handle public performance.


(Evan Wills) #6

That's odd.  Make sure direct links only is turned off.  It should follow all TYPE1 links and include the children.

 

Or are there TYPE1's under TYPE2's?  If that's the case then no, they won't be followed. But how would you represent them in the sitemap anyway?

 

E.g.

Root
  |- a (TYPE1)
  |  |- a-a (TYPE1)
  |  |- a-b (TYPE1)
  |  |- a-c (TYPE2)
  |
  |- b (TYPE2)
  |  |- b-a (TYPE1)
  |  |- b-b (TYPE1)
  |  |- b-c (TYPE2)

I would only expect a, a-a, and a-b to be listed.  Nothing under b should be listed.  If you are trying to get the TYPE1's under b to show up, then yeah, you'll need some kind of recursion as you are doing.

 

I was considering if there was a keyword so you could display everything and just hide TYPE2's, or use some if/else to show or hide it but I can't see any keywords for the type linking.

 

And the timeout, I reckon that's just processing of the recursion taking a long time.  Can check /_performance but I reckon you're over the limit.  Can possibly increase the timeout limit for your system, let the system cache handle public performance.

Hi Tim

 

Trouble is that when you turn off "Direct Links Only", you loose link type filtering, so you get a, a-a, a-b a-c, b, b-a & b-c.

 

Hopefully one of the Squiz crew will chime in with some advice.

 

PS: Thanks for your input though.


(Tim Davison) #7

Trouble is that when you turn off "Direct Links Only", you loose link type filtering

 

Hmm, I did not know that.  Useful piece of information I'm filing in my head for the future.


(Bart Banda) #8

What's the reason for not using a sitemap asset to generate the sitemap? It's no surprise that recursively nesting in an asset listing with that many pages times out, would need a big web/php memory limit set in order to achieve it. 

 

By the way, we are adding a new feature soon that will allow you to sort by tree order without having direct links turned on.


(Evan Wills) #9

Hi Bart.

 

No. It's definitely no surprise that the recursive listing doesn't work.

I'm we're having issues with our sitemap.xml (generated by an asset listing) which we submit to search engines where link type 2 asset are being shown in the sitemap and thus exposed to Google Search.

 

I was poking at the Sitemap asset to see if I could get around the problem. But that's for rendering a visual sitemap and so won't do what I need.

 

Do you have any suggestions?

 

Tree order stuff sounds good. I've been rocking the conditional keywords for the last week or so. (Since I read your comment on squizmap.) Thanks for that tip. It's changed my life and opened up some really cool possibilities.


(Bart Banda) #10

Hmm, I see. Asset listing would need to do it then, but is there a reason why you need to have the sort order as the tree order actually? Google doesn't really care about the order of the links in the sitemap.xml AFAIK so maybe you could try and just list everything that you need indexed by Google?

 

Alternatively, if you guys are using Funnelback, it could probably also generate that xml file from its index.

 

Out of curiosity, how big is the whole site that you need to have in the xml file? How many pages? Do they all actually need to be in the xml sitemap? Or could you just have the ones that provide links to other pages on them so that the crawlers can get to those non-sitemap links anyway?


(Evan Wills) #11

Hi Bart

 

Sorry. Tree order stuff is relevant for other things I'm doing but not to this use-case. So, no there's no reason we need tree order.

 

As for how big. I think it's some where around the 15,000 URLs. And no, probably not many of them need to be in the sitemap. If we were only to include pages in the sitemap that listed other pages, then we'd need to work out a way of specifying which pages should (or should not) be in the site map.


(Bart Banda) #12

Yea that's tricky, if there is no common pattern you can rely on that makes it even harder. With a site that big you may not even need a sitemap if you can confirm that spiders can get to all pages that you want indexed, and as long as you have good website implementation then it shouldn't be an issue. Your other option is probably only Funnelback, or use a 3rd party sitemap xml builder that can do it for you. 


(Evan Wills) #13

Hi Bart

 

Thanks for all that input. I suspected we'd need an outside option. I'll see what we can do with Funnelback.