Sitemap Generator – Must-Follow Patterns
Sitemap Generator — must-follow patterns…
Since version 0.95 of Sitemap Generator you can set “must-follow patterns”. Must-follow pattern looks like this:
In the above example we say the crawler to crawl only pages that include the text “blog/” in the url. In that case pages like “http://wonderwebware.com/blog/index.html” or “/freeware/index.html” will be added to the sitemap but pages that do not include the text “blog/” in the url will be skipped. The asterix ( * ) says the crawler that anything or nothing in the place of the asterix fits the pattern, so if the pattern is *blog/* then all these below will be spidered:
Note that you can set the pattern in different form, for example:
but in this case if the link in given html page looks like this one: “/nofollow/program.html” it will not be crawled (because the must-follow pattern requires the full domain url to be found in the link anchor). So, to crawl all pages from this /nofollow/ folder, you should use different pattern:
Now, no matter how the link is written in the html page, if it contains the text “nofollow/” it will be added too the site map.
IMPORTANT NOTE: If you set must-follow pattern, you must add start-page that fits the pattern in the “start-pages” box. For example, if we have the must-follow pattern below:
We must add some page that fits that pattern in the Start Pages box, in our example the main /blog/ page:
Because otherwise we will gen nothing. If we just leave the default start page (www.mysite.com), no link will be followed because the very first page visited by the spider will not fit the must-follow pattern.