A Breakdown of HTML Usage Across ~8 Million Pages (& What It Means for Modern SEO)

Not way back, my colleagues and I at Advanced Web Ranking got here up with an HTML research primarily based on about eight million index pages gathered from the highest twenty Google outcomes for greater than 30 million key phrases.

We wrote concerning the markup outcomes and the way the highest twenty Google outcomes pages implement them, then went even additional and obtained HTML utilization insights on them.

What does this need to do with search engine marketing?

The approach HTML is written dictates what customers see and the way serps interpret net pages. A legitimate, well-formatted HTML web page additionally reduces doable misinterpretation — of structured information, metadata, language, or encoding — by serps.

This is meant to be a technical search engine marketing audit, one thing we wished to do from the start: a breakdown of HTML utilization and the way the outcomes relate to trendy search engine marketing strategies and finest practices.

In this text, we’re going to deal with issues like meta tags that Google understands, JSON-LD structured information, language detection, headings utilization, social hyperlinks & meta distribution, AMP, and extra.

Meta tags that Google understands

When speaking about the principle serps as site visitors sources, sadly it is simply Google and the remaining, with Duckduckgo gaining traction these days and Bing nearly nonexistent.

Thus, on this part we’ll be focusing solely on the meta tags that Google listed within the Search Console Help Center.

chart (3).png
Pie chart exhibiting the entire numbers for the meta tags that Google understands, described intimately within the sections under.

<meta title=”description” content material=”…”>

The meta description is a ~150 character snippet that summarizes a web page’s content material. Search engines present the meta description within the search outcomes when the searched phrase is contained within the description.



<meta title="description" content material="*">


<meta title="description" content material="">


<meta title="description">


On the extremes, we discovered 685,341 meta parts with content material shorter than 30 characters and 1,293,842 parts with the content material textual content longer than 160 characters.


The title is technically not a meta tag, but it surely’s used along with meta title=”description”.

This is one of the 2 most necessary HTML tags in terms of search engine marketing. It’s additionally a should in accordance with W3C, which means no web page is legitimate with a lacking title tag.

Research means that when you maintain your titles beneath an inexpensive 60 characters then you may anticipate your titles to be rendered correctly within the SERPs. In the previous, there have been indicators that Google’s search outcomes title size was prolonged, but it surely wasn’t a everlasting change.

Considering all of the above, from the complete 6,263,396 titles we discovered, 1,846,642 title tags look like too lengthy (greater than 60 characters) and 1,985,zero20 titles had lengths thought-about too brief (beneath 30 characters).

Pie chart exhibiting the title tag size distribution, with a size lower than 30 chars being 31.7% and a size larger than 60 chars being about 29.5%.

A title being too brief should not be an issue —in spite of everything, it is a subjective factor relying on the web site enterprise. Meaning might be expressed with fewer phrases, but it surely’s positively an indication of wasted optimization alternative.





lacking <title> tag


Another fascinating factor is that, among the many websites rating on web page 1–2 of Google, 351,516 (~5% of the entire 7.5M) are utilizing the identical textual content for the title and h1 on their index pages.

Also, do you know that with HTML5 you solely have to specify the HTML5 doctype and a title so as to have a wonderfully legitimate web page?

<!DOCTYPE html>

<meta title=”robots|googlebot”>

“These meta tags can control the behavior of search engine crawling and indexing. The robots meta tag applies to all search engines, while the “googlebot” meta tag is specific to Google.”
– Meta tags that Google understands



<meta title="robots" content material="..., ...">


<meta title="googlebot" content material="..., ...">


HTML snippet with a meta robots and its content material parameters.

So the robots meta directives present directions to serps on how you can crawl and index a web page’s content material. Leaving apart the googlebot meta depend which is type of low, we have been curious to see probably the most frequent robots parameters, contemplating that an enormous false impression is that it’s a must to add a robots meta tag in your HTML’s head. Here’s the highest 5:



<meta title="robots" content material="index,follow">


<meta title="robots" content material="index">


<meta title="robots" content material="noodp">


<meta title="robots" content material="all">


<meta title="robots" content material="nofollow">


<meta title=”google” content material=”nositelinkssearchbox”>

“When users search for your site, Google Search results sometimes display a search box specific to your site, along with other direct links to your site. This meta tag tells Google not to show the sitelinks search box.”
– Meta tags that Google understands



<meta title="google" content material="nositelinkssearchbox">


Unsurprisingly, not many web sites select to explicitly inform Google to not present a sitelinks search field when their website seems within the search outcomes.

<meta title=”google” content material=”notranslate”>

“This meta tag tells Google that you don’t want us to provide a translation for this page.” – Meta tags that Google understands

There could also be conditions the place offering your content material to a a lot bigger group of customers will not be desired. Just because it says within the Google help reply above, this meta tag tells Google that you don’t need them to supply a translation for this web page.



<meta title="google" content material="notranslate">


<meta title=”google-site-verification” content material=”…”>

“You can use this tag on the top-level page of your site to verify ownership for Search Console.”
– Meta tags that Google understands



<meta title="google-site-verification" content material="...">


While we’re on the topic, do you know that when you’re a verified proprietor of a Google Analytics property, Google will now routinely confirm that very same web site in Search Console?

<meta charset=”…” >

“This defines the page’s content type and character set.”
– Meta tags that Google understands

This is principally one of the nice meta tags. It defines the web page’s content material kind and character set. Considering the desk under, we observed that almost half of the index pages we analyzed outline a meta charset.



<meta charset="..." >


<meta http-equiv=”refresh” content material=”…;url=…”>

“This meta tag sends the user to a new URL after a certain amount of time and is sometimes used as a simple form of redirection.”
– Meta tags that Google understands

It’s preferable to redirect your website utilizing a 301 redirect fairly than a meta refresh, particularly once we assume that 30x redirects do not lose PageRank and the W3C recommends that this tag not be used. Google will not be a fan both, recommending you employ a server-side 301 redirect as an alternative.



<meta http-equiv="refresh" content material="...;url=...">


From the entire 7.5M index pages we parsed, we discovered 7,167 pages which might be utilizing the above redirect methodology. Authors don’t all the time have management over server-side applied sciences and apparently they use this method so as to allow redirects on the consumer aspect.

Also, utilizing Workers is a cutting-edge various n order to beat points when working with legacy tech stacks and platform limitations.

<meta title=”viewport” content material=”…”>

“This tag tells the browser how to render a page on a mobile device. Presence of this tag indicates to Google that the page is mobile-friendly.”
– Meta tags that Google understands



<meta title="viewport" content material="...">


Starting July 1, 2019, all websites began to be listed utilizing Google’s mobile-first indexing. Lighthouse checks whether or not there is a meta title=”viewport” tag within the head of the doc, so this meta needs to be on each webpage, it doesn’t matter what framework or CMS you are utilizing.

Considering the above, we might have anticipated extra web sites than the four,992,791 out of 7.5 million index pages analyzed to make use of a legitimate meta title=”viewport” of their head sections.

Designing mobile-friendly websites ensures that your pages carry out nicely on all units, so be certain your net web page is mobile-friendly right here.

<meta title=”rating” content material=”…” />

“Labels a page as containing adult content, to signal that it be filtered by SafeSearch results.”
– Meta tags that Google understands



<meta title="rating" content material="..." />


This tag is used to indicate the maturity ranking of content material. It was not added to the meta tags that Google understands record till just lately. Check out this text by Kate Morris on how you can tag grownup content material.

JSON-LD structured information

Structured information is a standardized format for offering details about a web page and classifying the web page content material. The format of structured information might be Microdata, RDFa, and JSON-LD — all of these assist Google perceive the content material of your website and set off particular search consequence options for your pages.

While having a dialog with the superior Dan Shure, he got here up with a good suggestion to look for structured information, such because the group’s brand, in search outcomes and within the Knowledge Graph.

In this part, we’ll be utilizing JSON-LD (JavaScript Object Notation for Linked Data) solely so as to collect structured information data.This is what Google recommends anyway for offering clues concerning the which means of an internet web page.

Some helpful bits on this:

  • At Google I/O 2019, it was introduced that the structured information testing device will probably be outdated by the wealthy outcomes testing device.
  • Now Googlebot indexes net pages utilizing the newest Chromium fairly than the previous Chrome 42, which means you may mitigate the search engine marketing points you might have had up to now, with structured information help as nicely.
  • Jason Barnard had an fascinating discuss at SMX London 2019 on how Google Search rating works and in accordance with his idea, there are seven rating elements we are able to depend on; structured information is certainly one of them.
  • Builtvisible’s information on Microdata, JSON-LD, & Schema.org incorporates all the things you might want to find out about utilizing structured information in your web site.
  • Here’s an superior information to JSON-LD for freshmen by Alexis Sanders.
  • Last however not least, there are tons of articles, shows, and posts to dive in on the official JSON for Linking Data web site.

Advanced Web Ranking’s HTML research depends on analyzing index pages solely. What’s fascinating is that despite the fact that it isn’t acknowledged within the pointers, Google does not appear to care about structured information on index pages, as acknowledged in a Stack Overflow reply by Gary Illyes a number of years in the past. Yet, on JSON-LD structured information sorts that Google understands, we discovered a complete of 2,727,045 options:

Pie chart exhibiting the structured information sorts that Google understands, with Sitelinks searchbox being 49.7% — the best worth.











Corporate contact




Critic assessment




Employer combination ranking




Fact examine


FAQ web page




Job posting




Local enterprise










Q&A web page




Review snippet


Sitelinks searchbox


Social profile


Software app




Subscription and paywalled content material





The rel=canonical factor, typically known as the “canonical link,” is an HTML factor that helps site owners forestall duplicate content material points. It does this by specifying the “canonical URL,” the “preferred” model of an internet web page.



<hyperlink rel=canonical href="*">


meta title=”keywords”

It’s not new that <meta title=”keywords”> is out of date and Google does not use it anymore. It additionally seems as if <meta title=”keywords”> is a spam sign for most of the various search engines.

“While the main search engines don’t use meta keywords for ranking, they’re very useful for onsite search engines like Solr.”
– JP Sherman on why this out of date meta would possibly nonetheless be helpful these days.



<meta title="keywords" content material="*">


<meta title="keywords" content material="">


<meta title="keywords">



Within 7.5 million pages, h1 (59.6%) and h2 (58.9%) are among the many twenty-eight parts used on probably the most pages. Still, after gathering all of the headings, we discovered that h3 is the heading with the biggest quantity of appearances — 29,565,562 h3s out of 70,428,376  whole headings discovered.

Random information:

  • The h1–h6 parts symbolize the six ranges of part headings. Here are the complete stats on headings utilization, however we discovered 23,116 of h7s and seven,276 of h8s too. That’s a humorous factor as a result of loads of folks do not even use h6s fairly often.
  • There are three,046,879 pages with lacking h1 tags and inside the remaining of the four,502,255 pages, the h1 utilization frequency is 2.6, with a complete of 11,675,565 h1 parts.
  • While there are 6,263,396 pages with a legitimate title, as seen above, solely four,502,255 of them are utilizing a h1 inside the physique of their content material.

Missing alt tags

This everlasting search engine marketing and accessibility subject nonetheless appears to be frequent after analyzing this set of information. From the entire of 669,591,743 pictures, nearly 90% are lacking the alt attribute or use it with a clean worth.

chart (4).png
Pie chart exhibiting the img tag alt attribute distribution, with lacking alt being predominant — 81.7% from a complete of about 670 million pictures we discovered.





img alt=”*”


img alt=””


img w/ lacking alt


Language detection

According to the specs, the language data specified through the lang attribute could also be utilized by a consumer agent to regulate rendering in a range of methods.

The half we’re interested by right here is about “assisting search engines.”

“The HTML lang attribute is used to identify the language of text content on the web. This information helps search engines return language specific results, and it is also used by screen readers that switch language profiles to provide the correct accent and pronunciation.”
– Léonie Watson

A whereas in the past, John Mueller stated Google ignores the HTML lang attribute and beneficial the use of hyperlink hreflang as an alternative. The Google Search Console documentation states that Google makes use of hreflang tags to match the consumer’s language desire to the precise variation of your pages.

Bar chart exhibiting that 65% of the 7.5 million index pages use the lang attribute on the html factor, on the identical time 21.6% use no less than a hyperlink hreflang.

Of the 7.5 million index pages that we have been in a position to look into, four,903,665 use the lang attribute on the html factor. That’s about 65%!

When it involves the hreflang attribute, suggesting the existence of a multilingual web site, we discovered about 1,631,602 pages — meaning round 21.6% index pages use no less than a hyperlink rel=”alternate” href=”*” hreflang=”*” factor.

Google Tag Manager

From the start, Google Analytics’ most important activity was to generate stories and statistics about your web site. But if you wish to group sure pages collectively to see how persons are navigating via that funnel, you want a singular Google Analytics tag. This is the place issues get sophisticated.

Google Tag Manager makes it simpler to:

  • Manage this mess of tags by letting you outline customized guidelines for when and what consumer actions your tags ought to fireplace
  • Change your tags everytime you need with out truly altering the supply code of your web site, which generally could be a headache as a result of gradual launch cycles
  • Use different analytics/advertising instruments with GTM, once more with out touching the web site’s supply code

We searched for *googletagmanager.com/gtm.js references and noticed that about 345,979 pages are utilizing the Google Tag Manager.


“Nofollow” offers a approach for site owners to inform serps “don’t follow links on this page” or “don’t follow this specific link.”

Google doesn’t comply with these hyperlinks and likewise doesn’t switch fairness. Considering this, we have been inquisitive about rel=”nofollow” numbers. We discovered a complete of 12,828,286 rel=”nofollow” hyperlinks inside 7.5 million index pages, with a computed common of 1.69 rel=”nofollow” per web page.

Last month, Google introduced two new hyperlink attributes values that needs to be used so as to mark the nofollow property of a hyperlink: rel=”sponsored” and rel=”ugc”. I’d advocate you go learn Cyrus Shepard’s article on how Google’s nofollow, sponsored, & ugc hyperlinks impression search engine marketing, study why Google modified nofollow,  the rating impression of nofollow hyperlinks, and extra.

A desk exhibiting how Google’s nofollow, sponsored, and UGC hyperlink attributes impression search engine marketing, from Cyrus Shepard’s article.

We went a bit additional and appeared up these new hyperlink attributes values, discovering 278 rel=”sponsored” and 123 rel=”ugc”. To be certain we had the related information for these queries, we up to date the index pages information set particularly two weeks after the Google announcement on this matter. Then, utilizing Moz authority metrics, we sorted out the highest URLs we discovered that use no less than one of the rel=”sponsored” or rel=”ugc” pair:

  • https://www.seroundtable.com/
  • https://letsencrypt.org/
  • https://www.newsbomb.gr/
  • https://thehackernews.com/
  • https://www.ccn.com/
  • https://www.chip.pl/
  • https://www.gamereactor.se/
  • https://www.tribes.co.uk/


Accelerated Mobile Pages (AMP) are a Google initiative which goals to hurry up the cellular net. Many publishers are making their content material obtainable parallel to the AMP format.

To let Google and different platforms find out about it, you might want to hyperlink AMP and non-AMP pages collectively.

Within the thousands and thousands of pages we checked out, we discovered solely 24,807 non-AMP pages referencing their AMP model utilizing rel=amphtml.


We wished to know the way shareable or social a web site is these days, so understanding that Josh Buchea made an superior record with all the things that would go within the head of your webpage, we extracted the social sections from there and bought the next numbers:

Facebook Open Graph

Bar chart exhibiting the Facebook Open Graph meta tags distribution, described intimately within the desk under.



meta property="fb:app_id" content material="*"


meta property="og:url" content material="*"


meta property="og:type" content material="*"


meta property="og:title" content material="*"


meta property="og:image" content material="*"


meta property="og:image:alt" content material="*"


meta property="og:description" content material="*"


meta property="og:site_name" content material="*"


meta property="og:locale" content material="*"


meta property="article:author" content material="*"


Twitter card

chart (1).png
Bar chart exhibiting the Twitter Card meta tags distribution, described intimately within the desk under.



meta title="twitter:card" content material="*"


meta title="twitter:site" content material="*"


meta title="twitter:creator" content material="*"


meta title="twitter:url" content material="*"


meta title="twitter:title" content material="*"


meta title="twitter:description" content material="*"


meta title="twitter:image" content material="*"


meta title="twitter:image:alt" content material="*"


And talking of hyperlinks, we grabbed all of them that have been pointing to the most well-liked social networks.

chart (2).png
Pie chart exhibiting the exterior social hyperlinks distribution, described intimately within the desk under.



<a href*="facebook.com">


<a href*="twitter.com">


<a href*="linkedin.com">


<a href*="plus.google.com">


Apparently there are tons of web sites that also hyperlink to their Google+ profiles, which might be an oversight contemplating the not-so-recent Google+ shutdown.


According to Google, utilizing rel=prev/subsequent will not be an indexing sign anymore, as introduced earlier this 12 months:

“As we evaluated our indexing signals, we decided to retire rel=prev/next. Studies show that users love single-page content, aim for that when possible, but multi-part is also fine for Google Search.”
– Tweeted by Google Webmasters

However, in case it issues for you, Bing says it makes use of them as hints for web page discovery and website construction understanding.

“We’re using these (like most markup) as hints for page discovery and site structure understanding. At this point, we’re not merging pages together in the index based on these and we’re not using prev/next in the ranking model.”
– Frédéric Dubut from Bing

Nevertheless, listed below are the utilization stats we discovered whereas taking a look at thousands and thousands of index pages:



<hyperlink rel="prev" href="*"


<hyperlink rel="next" href="*"


That’s just about it!

Knowing how the typical net web page appears utilizing information from about eight million index pages can provide us a clearer thought of tendencies and assist us visualize frequent utilization of HTML in terms of search engine marketing trendy and rising strategies. But this can be a unending saga — whereas having tons of numbers and stats to discover, there are nonetheless tons of questions that want answering:

  • We know the way structured information is used within the wild now. How will it evolve and the way a lot structured information will probably be thought-about sufficient?
  • Should we anticipate AMP utilization to extend someplace sooner or later?
  • How will rel=”sponsored” and rel=“ugc” change the way in which we write HTML every day? When coding exterior hyperlinks, apart from the goal=”_blank” and rel=“noopener” combo, we now have to think about the rel=”sponsored” and rel=“ugc” mixtures as nicely.
  • Will we ever study to all the time add alt attributes values for pictures which have a objective past ornament?
  • How many extra extra meta tags or attributes will we’ve so as to add to an internet web page to please the various search engines? Do we actually wanted the newly introduced data-nosnippet HTML attribute? What’s subsequent, data-allowsnippet?

There are different issues we might have appreciated to deal with as nicely, like “time-to-first-byte” (TTFB) values, which correlates extremely with rating; I would extremely advocate HTTP Archive for that. They periodically crawl the highest websites on the net and report detailed details about nearly all the things. According to the newest data, they’ve analyzed four,565,694 distinctive web sites, with full Lighthouse scores and having saved explicit applied sciences like jQuery or WordPress for the entire information set. Huge props to Rick Viscomi who does an incredible job as its “steward,” as he likes to name himself.

Performing this large-scale research was a enjoyable trip. We realized quite a bit and we hope you discovered the above numbers as fascinating as we did. If there’s a tag or attribute specifically you wish to see the numbers for, please let me know within the feedback under.

Once once more, try the complete HTML research outcomes and let me know what you assume!

Source hyperlink Internet Marketing

Be the first to comment

Leave a Reply

Your email address will not be published.