A Comprehensive Guide From Semalt To Sitemap In XML Format (sitemap.xml)
Many articles and guides have been written about sitemap.xml over the years, but despite the large dose of knowledge available at our fingertips, I still encounter errors in this matter. So, I decided to collect and systematize the available knowledge in this area. If you have your website and you are not sure if the sitemap file was created in the correct way, I invite you to continue reading my article.
What exactly is sitemap.xml?
Sitemap.xml, i.e. a sitemap, is a file that should contain a list of pages that are important to us within the site. It is created so that indexing robots can more easily reach resources that are important to us, especially those newly created and those that are difficult to access due to the structure of our website or the way of internal linking.
The sitemap file as a single file should not exceed 50 MB and 50,000 URLs according to Google's guidelines. The sitemap file should be created in XML format.
Did you know that?
XML is an acronym for Extensible Markup Language. XML files are used to transmit data in a structured way because they are platform-independent, which is why this format is so popular and universal.
As a website administrator, blogger or e-commerce owner, you often make the mistake of wanting to include all subpages in such a sitemap. Why is it not a good choice? What about pages like terms and conditions that all link to with rel="nofollow"? Or noindex pages? You will learn more about which URLs should be included in your sitemap later in this article.
What data does sitemap.xml consist of?
As mentioned earlier, the XML format allows for the presentation of data in a systematic way. This means that it mainly consists of specific tags that have a specific role. Using this format ensures that everyone submits URL information in the same way that crawlers can easily read it. Below you will find information about the 3 most important tags without which the sitemap cannot exist.
Important! The sitemap.xml file should be UTF-8 encoded!
The most important tags in sitemap.xml
<urlset> Contains the file and reference to the current protocol standard. It is the beginning and ending element of each of the sitemap.xml files. It contains all the tags.
<url> The parent tag of each URL entry that we want Google crawlers to find. There must be one more <loc> tag in the <url> tag for the record to be valid. It can be enriched with additional/optional tags, which I will write about later.
<loc> Location <loc> is a marker indicating the location of a given subpage. The tag should contain the full URL address, i.e. also with the HTTP/HTTPS protocol.
Optional tags in sitemap.xml
<lastmod> Informs about the date of modification of the content on a given subpage. Robots know whether the content of a given subpage has been changed since the last scan. In the last mod, we use W3C Datetime, which allows you to enter only the date in the form of YYYY-MM-DD, without specifying the time.
<priority> tag, which is supposed to indicate to crawling robots which subpages are the most important for us and should be indexed first. The range of values in this tag is from 0.0 to 1.0, where the default priority for subpages is 0.5.
<changefreq> A tag specifying the frequency of changes on a given subpage within the website. In principle, this element was to help determine the frequency of scanning a given subpage, correlated with the changes made on it.
Valid values that may be in <changefreq>:
- always - documents that change each time they are opened;
- hourly - changes every hour;
- daily - changes every day;
- weekly - changes every week;
- monthly - changes every month;
- yearly - changes every year;
- never - never changed.
What URLs should be in the sitemap?
As I mentioned at the beginning of the article, not all URLs should be included in our sitemap. I encounter very often incorrect configurations of this element, which may negatively affect the crawl budget of the website. So, let's make sure that only valuable subpages are included in the sitemap. These include primarily:
- Pages generating response code 200
- Pages not blocked in robots.txt
- Canonical links
- Valuable pages for users
- Pages that are not password protected or that are difficult to access
Mainly, depending on the type of page, these will be:
- Home page
- Categories and product pages
- Blog entries
- Blog categories
- FAQ pages
- Static (information) pages
What URLs should not be included in the sitemap?
Above, I have included information on what URLs should be included in the sitemap. It is equally important to know which addresses we should avoid when creating a sitemap. These are primarily:
- Redirect URLs
- 40X and 50X Error Pages
- Pages blocked in robots.txt
- Pages tagged with noindex
- Pages of low value for users (regulations, privacy policies)
- Pagination pages
- Search results pages
- Pages with filtering/sorting parameters
How to generate a sitemap? - The most popular methods
Depending on how large a website we have and what content management system (CMS) we use, sitemap generation can be done using free tools (sitemap.xml generators) or built-in/additional tools/plugins.
How to generate sitemap.xml for WordPress?
Let's start with the most popular content management system. The fastest and easiest way to create a sitemap is to use the Yoast SEO plugin. It automatically creates a sitemap for us, we select the appropriate settings and decide which resources should be included in it. The plugin is very intuitive and easy to use. In addition, its basic version has enough options for most webmasters.
Creating XML sitemaps for CMSs that do not have built-in functionality
If the content management system you use does not have built-in functionality or an additional module that you can use to create a sitemap, it does not mean that you have to create it manually. There are several tools, both free and paid, that you can use.
An online generator that will create a sitemap, usually in the free version, has some limitations. In most cases, there is a maximum limit of pages that we can put in sitemap.xml and that is 500 URLs. In the paid versions of plugins, we will not encounter any quantitative restrictions. I recommend sitemap generators, especially to owners of small websites, where the aforementioned limit will not be a problem.
Note: For manually created sitemaps, they are not updated automatically. Remember to update them after adding new products, entries or pages.
Where to put the sitemap.xml file?
The generated sitemap.xml file is usually located in the root directory of the website it concerns and is available at https://your-domain-name.com/sitemap.xml. However, this is not a necessary requirement.
Both the name and the path to the sitemap may differ depending on whether the sitemap was added manually or if we used built-in solutions.
To make it easier for crawlers to get to your sitemap, it's a good idea to include the path to your sitemap in your robots.txt file.
All you have to do is add the Sitemap URL: https://your-domain-name.com/sitemap.xml
The most common sitemap.xml types
Sitemaps are not always the same - depending on the type and size of your site, we may need different types of sitemaps, which will be discussed below. Be sure to find out what a sitemap index is, when it is worth choosing an image map and how our sitemap relates to Google News.
A standard XML sitemap that links to web pages within our site. The most common name is sitemap.xml
Sitemap-index.xml summary map
A sitemap index is nothing more than a sitemap containing other sitemaps within it. Used for very large sites, where the potential sitemap would exceed 50 MB - such maps should be divided into several smaller ones and linked to them using the sitemap index.
Also, the previously mentioned Yoast SEO plugin creates a sitemap index for various types of pages. Thanks to it, a separate sitemap is created for pages, blog entries, blog categories or the author.
Site map with image files and videos
If you want your image files to be included in the Google image search engine, you must increase your chances by creating a dedicated sitemap containing links to image files. While crawlers have no problem finding and indexing image files, settings such as lazyload may make it difficult for them. Creating a map of graphic files is very simple.
Note: In older maps of image files, there were still tags such as ≶image:caption>, ≶image:geo_location>, ≶image:title>, ≶image:license>. They have been removed from the documentation and there is no need to assign them to each of the graphic files.
Sitemaps for articles in Google News
What is Google News, I don't think I need to explain to anyone. News has become a source of information for a huge number of users, which is why every online publisher wants to be there. The sitemap with news articles should contain links to articles that are not older than 2 days. For older articles, remove them from the sitemap.
How to submit sitemap.xml in Google Search Console?
We do not create sitemaps for ourselves or users, but for crawlers. I recommend publishing your sitemap and submitting it in the Google Search Console tool so that Google robots can easily reach it. You are probably already using Google Search Console, which is, after all, a basic analytical tool like Semalt's Dedicated SEO Dashboard which is also a very powerful SEO analysis tool.
If you've already set up a Google Search Console for your site, you're ready to submit a sitemap.
- Step 1: Go to the "Sitemaps" tab in the side menu.
- Step 2: Enter the path to your sitemap, usually sitemap.xml or sitemap-index.xml
- Step 3: Verify the status of the sitemap after it has been submitted. After the sitemap is uploaded, information about the upload date, last read and status and URLs detected will be displayed.
Note: If the status is "Failed to download" instead of "Success", please submit the sitemap again. If the error persists, check if the file is available at the specified address.
Why is sitemap.xml so important from the positioning point of view?
The creation of sitemaps in 2005 was a major step towards better indexation of website resources by search engines. Webmasters who wanted the best site results in search engines quickly adopted this new solution. Over the years, search engines and their crawlers have evolved and found resources on our subpages better and better.
Checking and optimizing the sitemap has also become a basic element during SEO audits. Personally, I think that robots with the right structure of the page and good internal linking will have no problem indexing our subpages, although recently they do it much slower.
Creating a sitemap is a relatively quick and simple activity that will make it easier to find pages that are difficult to access (e.g. the so-called orphan pages). This will not translate into the position of your website, but it can affect the faster discovery of pages by robots. Remember to submit your sitemap in Google Search Console.
The sitemap is one of the basic elements of website optimization. You can create a sitemap using the CMS and its built-in functions or using publicly available tools by typing "sitemap generator" in your browser. Its generation and addition in GSC will make it easier for robots to reach all subpages that we want to index and include in the map. Including a sitemap is especially important for large websites. If you do not know whether the CMS system you use has the ability to generate XML sitemaps, I invite you to leave a comment under the article.