SEO and sitemap

Two concerns share one chapter because they share one source of truth: the content tree, public_origin, and the supported-language list. Every page emits per-language metadata in <head> for crawlers fetching that URL, and a background writer drops a sitemap.xml that agrees with what those crawlers see.

What gets emitted

The SEO block lives in templates/base/base.html, inside a single -[ if canonical ]- guard so the entire group either renders or vanishes. In order of emission:

  1. <link rel="canonical"> — the bare URL for this resource.
  2. <link rel="alternate" hreflang="…"> — one per supported language, plus x-default. Driven by the alternates template var.
  3. <meta property="og:type"> (website), og:title, og:description, og:url (same value as canonical), og:site_name, og:locale.
  4. <meta property="og:locale:alternate"> — one per alternate, filtered by alt["is_alt"] so the current language isn’t duplicated.
  5. <meta name="twitter:card"> (summary_large_image), twitter:title, twitter:description.
  6. <meta property="og:image"> and <meta name="twitter:image">, both gated on -[ if og_image ]- so a resource with no usable image just omits the pair instead of emitting an empty attribute.
  7. <script type="application/ld+json"> carrying a BreadcrumbList, gated on -[ if breadcrumb_jsonld ]-.

Conditional emission

The single -[ if canonical ]- block wraps everything from canonical through twitter:image. When public_origin is unset, canonical_url in src/content.rs returns an empty string, so the whole block is suppressed rather than emitting tags with garbage URLs. og:image / twitter:image have their own inner -[ if og_image ]- because a page may have a valid canonical without a usable image. The JSON-LD <script> is a separate outer guard — breadcrumbs work fine without a public origin, just with relative item URLs.

Per-page values

All of these live in src/content.rs:

  • canonical_url(path)format!("{}{}", origin, path) or "" when origin is empty. The empty return is the off-switch for the entire SEO block.
  • og_image_for(header, resource_path, lang) — precedence: Header.banner for the current language, then Header.icon if absolute (/ or http-prefixed), then default_og_image. Relative values are resolved against public_origin; if no absolute URL can be formed, returns None and the template suppresses the tag.
  • hreflang_alternates(path, current_lang) — returns an empty list when public_origin is empty. Otherwise emits one entry per supported language. The first entry in SUPPORT_LANGS is treated as the default language and gets the bare URL <origin><path>; every other language gets <origin>/<code><path>. Each entry carries is_alt: code != current_lang so the og:locale:alternate loop can skip the active language. Finally, an x-default entry pointing at the bare URL is appended.
  • breadcrumb_jsonld(breadcrumb) — flattens a path_value()-shaped array into {"@context": "https://schema.org", "@type": "BreadcrumbList", "itemListElement": [...]} with position, name, and item per step.

Escape pipeline

html_escape covers &, <, >, ", '. It runs over every template-bound string (canonical, alternates’ href, og:image, breadcrumb names) before akari sees it — akari does not auto-escape.

breadcrumb_jsonld does its own escaping with a nested json_escape that additionally handles <<, >>, &&, and U+2028 / U+2029. The HTML chars matter because the JSON-LD body sits inside <script type="application/ld+json">: a breadcrumb name containing </script> would otherwise break out of the script element. U+2028 / U+2029 are line separators that historically break inline JS strings — belt-and-suspenders against future inline-JS rendering. The breadcrumb_html_escape copy used for the visible breadcrumb <ol> is kept separate; passing the HTML-escaped copy into breadcrumb_jsonld would double-escape.

Sitemap writer

src/routes/sitemap.rs runs a background tokio task spawned from main.rs via routes::sitemap::spawn_writer(). The task:

  1. Loads the previous programfiles/op/sitemap.xml (the SITEMAP_FILE const) into the in-process CACHE, so the public endpoint always has something to serve during the first build.
  2. Runs refresh() immediately, then loops every REGEN_INTERVAL_SECS (3600 seconds = one hour).

CACHE is Lazy<RwLock<Arc<String>>> so the /sitemap.xml endpoint can read-lock briefly, clone the inner Arc, drop the lock, and respond. Under a request burst the per-request cost is bounded to one Arc clone plus one String clone for the body — no filesystem read, no content-tree walk.

Atomic write

persist writes to SITEMAP_FILE.tmp and then renames over the real path. Readers (either the writer’s own preload on next restart, or an admin running cat) never see a partially-written XML file.

render_url_entry

Each <url> block contains a <loc> for the default-language URL and one <xhtml:link rel="alternate" hreflang="…"> per supported language, followed by an x-default link. Default-lang URLs are bare; non-default URLs carry the /<code> prefix. This mirrors hreflang_alternates so that what crawlers see in a page’s <head> agrees with what they see in the sitemap. Paths come from collect_paths, which walks programfiles/content/ and emits every folder that has a ctx.json — except type = "link" folders, which 302-redirect off-site and aren’t worth indexing.

robots.txt

programfiles/op/robots.txt is checked-in static content, not generated. It’s a minimal allow-all (User-agent: * / Disallow:) plus a Sitemap: line pointing at /sitemap.xml on the running origin. Update it by hand when the public origin changes; it’s served as-is.

Config keys consumed

  • public_origin — gates the whole SEO block, drives og:url and canonical, and prefixes every sitemap entry. When unset, both the per-page SEO block and the sitemap body go away (build_xml returns an empty string).
  • default_og_image — fallback used by og_image_for when no banner / absolute icon is configured.

See Language system for the language-prefix scheme used by hreflang and sitemap alternates, Backup writer for the parallel writer / atomic-write pattern, and Common: Header and LangDict for how og:image, name, and desc resolve per language.