The other day I needed to get the URLs for all pages in my blog for some PowerShell scripting I wanted to do. Like most websites this blog has a sitemap and I wanted to use that as a source.
As I could not find any existing PowerShell scripts on the web that I could use, I just wrote one myself.
Now I like to share this script with you.
At the end of the article, you will find the link to the source code.
Basic usage
Just execute the script with an URL to an XML Sitemap.
.\ConvertFrom-SiteMap.ps1 -Url https://blog.hompus.nl/sitemap.xml
Which results in something like this:
loc : https://blog.hompus.nl/ lastmod : 2017-02-10T23:57:52+00:00 changefreq : daily priority : 1.0 loc : https://blog.hompus.nl/category/azure/ lastmod : 2017-02-08T17:43:41+00:00 changefreq : weekly priority : 0.3 loc : https://blog.hompus.nl/category/csharp/ lastmod : 2016-12-14T08:53:46+00:00 changefreq : weekly priority : 0.3
Displaying only the first 3 entries
Under the section XML tag definitions the protocol states that the lastmod, changefreq and priority attributes are optional, so they can be missing when querying a different sitemap.
Now you can do basic PowerShell manipulation of the result set like sorting, selecting, filtering and formatting. For example:
.\ConvertFrom-SiteMap.ps1 -Url https://blog.hompus.nl/sitemap.xml | Sort-Object priority -Descending | Select-Object priority, changefreq, loc -First 5 | Where-Object { $_.priority -ge 0.3 } | Format-Table
Outputs:
priority changefreq loc -------- ---------- --- 1.0 daily https://blog.hompus.nl/ 0.6 weekly https://blog.hompus.nl/about/ 0.6 weekly https://blog.hompus.nl/archives/ 0.3 weekly https://blog.hompus.nl/category/windows/ 0.3 weekly https://blog.hompus.nl/category/visual-studio/
Ignoring Sitemap Index entries
A sitemap can also be a Sitemap Index File. The file then contains links to other sitemap files.
By default, the script will follow these links. If you don’t want this to happen you can set the NoFollow
switch.
.\ConvertFrom-SiteMap.ps1 -Url https://blog.hompus.nl/sitemap.xml -NoFollow
Source code
The complete source code is available as a GitHub Gist: ConvertFrom-SiteMap.ps1.