Diff - 71596413d42c3db7b343a8112535eac9e5a9ef2b^! - chromium.googlesource.com/website

commit	71596413d42c3db7b343a8112535eac9e5a9ef2b	[log] [tgz]
author	Dirk Pranke <dpranke@google.com>	Sat Nov 06 19:32:14 2021 -0700
committer	chromium-website-scoped@luci-project-accounts.iam.gserviceaccount.com <chromium-website-scoped@luci-project-accounts.iam.gserviceaccount.com>	Tue Nov 09 17:53:33 2021 +0000
tree	a55e6d3c455149512474de70537750b8d5868cea
parent	7aa01375fea113c6a05f99320d0de97d89e51188 [diff] [blame]

Scripts for crawling the website.

This lands a script that acts as a simple web crawler.

It'll crawl http://www.chromium.org and recursively fetch
all of the content on the site, rewriting any URLs to point
to the local copies of the pages.

The script uses a number of hard-coded values:
  - the list of URLs to use as starting puts (in paths_to_crawl.txt)
  - a list of URLs to skip (because they don't exist)
  - alternate ways of writing equivalent URLs (e.g. /foo and
    /a/chromium.org/dev/foo).
and so it is not a general-purpose crawler, but it would not be
too hard to generalize it a bit more to become one.

It also supports a `--prefix` option to rewrite all of the URLs
to that they start at an offset from the root directory of the
web server, e.g., instead of fetching /Home from ./home, it'll fetch
it from $prefix/home. This is useful if you want to embed the
crawled content as a part of the larger site.

In the context of the chromium.org migration, this crawler and the
prefix option will allow us to nest the original content inside
the new content, in order to be able to easily compare the two.

Bug: 1267643
Change-Id: I9b917f4a1dba795bcda8286611e1d631d7d27518
Reviewed-on: https://chromium-review.googlesource.com/c/website/+/3266881
Commit-Queue: Dirk Pranke <dpranke@google.com>
Auto-Submit: Dirk Pranke <dpranke@google.com>
Reviewed-by: Struan Shrimpton <sshrimp@google.com>

diff --git a/scripts/common.py b/scripts/common.py
index f5c8e3d..a911f15 100644
--- a/scripts/common.py
+++ b/scripts/common.py

@@ -83,7 +83,7 @@
         return page + '/index' + ext
     if os.path.exists(top + page):
         return page
-    if os.path.exists(top + page + ext):
+    if ext and os.path.exists(top + page + ext):
         return page + ext
     return page