用 Screaming Frog 和 Search Console 进行技术性 SEO 审计

一个客户曾给我发来一个被"专业代理商 SEO 优化了 18 个月"的网站。排名停滞不前。流量同比下降。代理商的报告有 47 页，其中包括"品牌语音一致性"的部分。它没有包括的是 3,400 个页面返回 200 状态码但元标签中嵌入了 noindex 标记。三千五百个页面。消失了。不可见。代理商从未真正爬过该网站。

关键要点：Screaming Frog爬取与Search Console数据交叉参照仍然能发现任何网站上大多数技术SEO问题；方法比工具本身更重要。A Screaming Frog crawl cross-referenced with Search Console data still finds most technical SEO problems on any site; the method matters more than exotic tooling.

我在一周内修复了它。使用 Screaming Frog 和 Google Search Console。

这就是技术SEO的特点，它奖励那些真正看数据的人，而不是光说不练的人。老实说，对于我通过Seahawk审计的90%的网站，我不需要Ahrefs、Semrush或任何大型平台来找出真正影响性能的问题。两个工具。一套流程。就是这样。Seahawk, I don't need Ahrefs, Semrush, or any of the big platforms to find the problems that are genuinely hurting performance. Two tools. One process. Here it is.

---

爬取任何东西之前，正确设置 Screaming Frog

大多数人打开 Screaming Frog，粘贴一个 URL，然后点击开始。对于 50 页的博客来说还不错。但对于更大的网站，你要等 40 分钟才能得到错误的数据。

配置比爬取速度更重要

我做的第一件事：进入Configuration > Spider，确保我爬取的是正确的协议。如果网站在HTTPS上（应该是这样），我从规范的HTTPS主页开始。我还关闭了对某些文件类型的爬取——PDF、图片、视频，除非我特别想审计那些。这能减少一半的爬取时间。Configuration > Spider and make sure I'm crawling the correct protocol. If the site is on HTTPS (it should be), I'm starting from the canonical HTTPS homepage. I also turn off crawling of certain file types, PDFs, images, videos, unless I specifically want to audit those. It halves the crawl time.

然后我将 Configuration > Respect Canonical Tags 设置为关闭。我知道这有点反直觉。但我想看到每个规范化的 URL，这样我可以审计规范化是否真的正确。如果 Screaming Frog 跳过规范化的页面，你永远不会知道它们存在。Configuration > Respect Canonical Tags to off. Counter-intuitive, I know. But I want to see every canonicalised URL so I can audit whether the canonicalisation is actually correct. If Screaming Frog skips canonicalised pages, you'll never know they exist.

还有一点：在Configuration > Custom Extraction下，我设置一条提取规则来直接从HTML源代码拉取原始的<title>和meta description。为什么？因为某些WordPress网站，特别是同时运行Yoast和页面构建器的网站，会输出两个title标签。Screaming Frog的默认列只显示第一个。提取规则会显示所有内容。Configuration > Custom Extraction, I set up an extraction rule to pull the raw <title> and meta description directly from the HTML source. Why? Because some WordPress sites, particularly ones running Yoast alongside a page builder, output two title tags. Screaming Frog's default column only shows you the first one. The extraction rule shows you everything.

---

第一遍：我在爬取数据中查找的内容

爬取完成后，我不会从损坏的链接开始。每个人都从损坏的链接开始。我从 Response Codes 标签开始，筛选 3xx 重定向。Response Codes tab and filter for 3xx redirects.

2021年回溯，Seahawk接了一个电商客户，中等规模的家具零售商，大约8000个URL。他们的开发团队两年来一直在临时处理重定向。我们发现了19条重定向链，其中一些有四跳长。页面A重定向到页面B，页面B重定向到页面C，页面C重定向到页面D。谷歌说它会跟随多达10跳，但实际上，超过两跳的任何东西都会浪费爬取预算并稀释链接权益。我们把所有内容都简化为单跳重定向。仅这一项改动，没有内容更改，没有链接建设，在六周内就让三个分类页面从第3页升到了第1页。Google says it follows up to 10 hops, but in practice, anything beyond two hops wastes crawl budget and dilutes link equity. We collapsed everything to single-hop redirects. That alone, no content changes, no link building, moved three category pages from page 3 to page 1 within six weeks.

我处理标签的顺序

Response Codes → 3xx，重定向链和循环, redirect chains and loops
Response Codes → 4xx，断链页面（按入站链接筛选以优先处理）, broken pages (filter by inlinks to prioritise)
Indexability → Non-Indexable，noindex、指向其他地方的规范标签、被robots.txt阻止, noindex, canonicals pointing elsewhere, blocked by robots.txt
Page Titles，缺失、重复、超过60个字符, missing, duplicated, over 60 characters
Meta Description，缺失或重复（不是排名因素，但点击率很重要）, missing or duplicated (not a ranking factor, but click-through matters)
H1，缺失、重复或页面上有多个, missing, duplicated, or more than one per page
图片 → 缺少 Alt 文本，这是快速胜利，尤其是对产品网站来说, quick win, especially for product sites
指令 → 规范化标签，检查这些是否与实际可索引的 URL 匹配, check these match the actual indexable URL

这个顺序是有意的。我从结构问题（重定向、损坏的页面）开始，然后到页面内问题。修复一条损坏的重定向链会帮助该链中的每个页面。修复一个缺失的元描述只能帮助一个页面。

---

在Search Console中分层：有趣的地方开始了

Screaming Frog告诉你网站上有什么。Search Console告诉你Google认为网站上有什么。这两个数据集之间的差距就是真正的问题所在。

打开Coverage（或在新界面中的Indexing → Pages）。你要查看四件事：Coverage (or Indexing → Pages in the newer interface). You're looking at four things:

错误，Google 尝试索引但无法访问的页面, pages Google tried to index and couldn't
有警告的有效页面，通常显示"已提交的 URL 未被选为规范 URL"，这是你需要解决的混乱状况, often "Submitted URL not selected as canonical," which is a mess you need to untangle
已排除，Google 选择不索引的页面（已爬取但未索引、noindexed 等）, pages Google chose not to index (crawled but not indexed, noindexed, etc.)
有效，Google 已索引的页面, pages Google has indexed

"已排除"这一类别使用频率低得不成比例。大多数人忽视它。我直接去那里。按"已爬取，目前未索引"筛选。Google 在说：我找到了这个页面，我读了，然后决定它不值得索引。这几乎总是内容薄弱的问题。或者它是一个本身没问题的页面，但与另一个页面过于相似，这是分面导航或标签存档的典型问题。I found this page, I read it, and I decided it wasn't worth indexing. That's almost always a thin content problem. Or it's a page that's genuinely fine but is too similar to another page, a classic issue with faceted navigation or tag archives.

将GSC排除与你的Screaming Frog抓取相对应

将你的Screaming Frog抓取导出为CSV。从Search Console导出"Excluded"网址。将两者加载到Google Sheets中并运行VLOOKUP。任何出现在Screaming Frog抓取中且出现在GSC排除列表中的网址都是优先调查对象。and in the GSC excluded list is a priority investigation.

我知道人们会为此编写Python脚本。但你不需要。Sheets中的VLOOKUP只需四分钟，能给你同样的答案。

---

爬虫预算：只有在你的网站真正很大时才重要

好吧，说实话吧。如果你的网站页面数少于1,000个，爬虫预算不是你的问题。你可以停止担心这件事了。

但当你超过约 10,000 个 URL 时，很多 WooCommerce 或 Magento 商店仅从产品变体和筛选 URL 就会达到这个数量，爬取预算就开始产生影响。Google Search Central 关于爬取预算的文档实际上是他们写得最清楚的内容之一。值得认真阅读。Google Search Central documentation on crawl budget is actually one of the clearer things they've written. Worth reading properly.

你在Search Console中有两个控制杆：爬虫统计报告和URL检查工具。爬虫统计显示Google在90天内的爬虫活动：每天爬取的页面数、响应时间、响应代码。如果你在某个特定日期看到404错误激增，那是部署出了问题。如果平均爬虫响应时间超过2秒，问题在你的服务器，不在你的SEO。Crawl Stats report and the URL Inspection tool. Crawl Stats shows you Google's crawl activity over 90 days: pages crawled per day, response times, response codes. If you see a spike in 404s on a specific date, that's a deployment that went wrong. If average crawl time is above 2 seconds, your server is the problem, not your SEO.

---

内部链接：代理商总是忽视的东西

我在 Seahawk 审计过超过一百个网站，这些客户在链接建设、客座博文、数字公关等方面花了真金白银，却有孤立的页面，没有内部链接指向它们。Google 无法优先考虑它通过你的网站结构找不到的内容。orphaned pages that no internal link pointed to. Google can't prioritise what it can't find through your site structure.

在Screaming Frog中，按Inlinks = 0过滤爬虫结果。任何没有内部链接的页面都是孤立页面。将其与Search Console的索引页面交叉参考。如果该页面已被索引但没有内部链接，这意味着Google是通过XML网站地图或外部反向链接找到它的。这很脆弱。从一个相关页面给它添加一个内部链接，你就是在向Google传递一个结构信号，表明这个页面很重要。Inlinks = 0. Any page with zero internal links is an orphan. Cross-reference it against Search Console's indexed pages. If the page is indexed but has no internal links, it means Google found it through an XML sitemap or an external backlink. That's fragile. Give it an internal link from a relevant page and you're giving Google a structural signal that this page matters.

我注意到的内部链接的几个问题

分页页面链接到产品/文章页面，但这些页面没有链接回分类页面
2019年发布的博客文章从未被任何较新的内容链接过
有数十条内部入站链接但在GSC中流量极低的页面，这通常表示页面本身有问题，而不是链接的问题。

---

Core Web Vitals：读懂数据，别慌张

Search Console有Core Web Vitals报告。它从真实用户Chrome UX Report数据中提取，这是现场数据、实际用户在实际设备上的数据，而不是实验室模拟。这比一次性Lighthouse运行更有意义。Core Web Vitals report. It pulls from real-user Chrome UX Report data, which is field data, actual users on actual devices, not a lab simulation. This is more meaningful than what you'd get from a one-off Lighthouse run.

该报告根据LCP、FID（现已被INP取代）和CLS将URL分组为"良好"、"需要改进"和"较差"。不要试图一次性修复所有内容。按"较差"组排序，查看哪个URL模式的失败页面最多。通常是单个模板、所有产品页面CLS失败，或所有分类页面LCP缓慢。修复模板，一次修复数百个页面。

我辛苦学到的一件事：有广告或cookie横幅的网站上的CLS问题几乎总是来自于首屏后初始绘制后注入的元素。Screaming Frog抓不到这个。你需要看实际的页面。在Chrome DevTools中启用Rendering里的Layout Shift regions。

---

Robots.txt 和网站地图检查（耗时 10 分钟，节省数周）

访问 yourdomain.com/robots.txt 。逐行阅读。我亲眼见过一个运营中的生产网站在 robots.txt 中设置了 Disallow: / 。不是测试网站。是生产网站。一个运营了七年的企业。他们的开发者在迁移时复制了测试环境的 robots.txt，之后再也没有检查过。在他们发现问题之前，这个网站已经对谷歌基本不可见了四个月。yourdomain.com/robots.txt . Read every line. I have seen, with my own eyes, a live production site with Disallow: / in the robots.txt. Not a staging site. Production. A seven-year-old business. Their developer had copied the staging robots.txt during a migration and never checked it. They had been essentially invisible to Google for four months before they noticed.

在Search Console中，进入Sitemaps。检查已提交的内容。检查Google上次获取它的时间。如果sitemap超过一周未被获取，说明出现了问题。还要检查已提交的URL数量与索引的URL数量，如果你提交了4,000个URL但只有1,200个被索引，那这是一个关于内容质量的问题，而不是技术问题。Sitemaps. Check what's been submitted. Check the last time Google fetched it. If the sitemap hasn't been fetched in over a week, something is broken. Also check the submitted URL count vs the indexed URL count, if you've submitted 4,000 URLs and only 1,200 are indexed, that's a conversation you need to have about content quality, not about technical fixes.

---

常见问题

我需要购买付费版本的 Screaming Frog 吗？

免费版本上限为500个URL。对于超过这个数量的网站（大多数值得审计的网站都超过这个数量），你需要付费许可证。截至撰写本文时，价格为259英镑/年。这大约相当于一小时的代理时间。购买它。£259 per year as of writing. That's about the price of a single hour of agency time. Buy it.

我应该多久运行一次技术审计？

对于定期发布内容或产品频繁变化的活跃网站，我建议每季度一次。对于较小的、更新较少的网站，一年两次就可以了。运行一次审计然后把它当作"完成"，就像给汽车换一次油然后期待它永远正常运转一样。

Screaming Frog显示200状态但GSC显示页面未索引，为什么？

几乎总是以下三种情况之一：noindex元标签、noindex HTTP标头或指向其他地方的canonical标签。通过Search Console的URL Inspection工具运行该URL，它会准确告诉你Google发现了什么。这个工具被低估了，它显示Google上次抓取的页面版本，包括渲染后的HTML，这样可以捕获基本HTTP请求无法看到的JavaScript注入的noindex标签。last crawled version of the page, including the rendered HTML, which catches JavaScript-injected noindex tags that a basic HTTP request wouldn't see.

那JavaScript渲染的网站呢？

Screaming Frog在Configuration > Spider > Rendering下有JavaScript渲染模式。对于JS较多的网站，打开它。它的速度会慢很多，非常慢，但这是唯一能捕获初始HTML加载后由JavaScript注入的内容或链接问题的方法。对于React或Next.js网站，始终在JS渲染模式下抓取。Configuration > Spider > Rendering. Turn it on for JS-heavy sites. It's slower, significantly slower, but it's the only way to catch issues with content or links that are injected by JavaScript after the initial HTML loads. For a React or Next.js site, always crawl in JS rendering mode.

Google Search Console对关键词研究够用吗？

对于查找你现有页面排名的查询，是的，它很出色。对于发现新的关键词机会，不，你需要其他工具。但这超出了技术审计的范围。existing pages rank for, yes, it's excellent. For discovering new keyword opportunities, no, you'll need something else. But that's out of scope for a technical audit.

---

两个工具。一个电子表格。几个小时。这真的就是全部所需。昂贵的平台有其用处，我不反对它们，但我见过太多网站所有者假设花钱更多意味着发现更多。问题几乎总是在基础知识上。只是需要有人真正去查看。