Automatic Parsing

Requirements

I found an awesome browser extension called Furiganator which can insert Furigana (Ruby text) in the webpage with the following features:

  • Very fast and reliable.
  • Furigana in Hiragana, Katakana or Romaji.
  • Option to show Furigana only on mouse over.
  • Option to make the furigana text unselectable.
  • Option to use color-tags to easily identify the kanji level or grade.
  • Multiple color themes.

In its Licenses, Certificates and Acknowledgements it describes the technologies used:

Now I want to implement the same features in our minimal-mistakes-plus plugin with the same technologies. Specifically,

  • It should be very fast and reliable.
  • Use the the option furigana: true/false in the front matter to turn the whole furigana parsing process on (true) or off (false or unset).
  • It can add Furigana in Hiragana, Katakana or Romaji automatically to normal texts without the furigana pattern using the same technologies as Furiganator depending on the option furigana_mode, which

    • applies to the whole document when specified in the front matter:

      furigana_mode: hiragana/katakana/romaji
    • applies to only the paragraphs marked by the attribute syntax:

      {: furigana_mode="hiragana/katakana/romaji"}

    The attribute syntax takes precedence if both applies to a paragraph.

    If the option is unset, default to hiragana.

  • Use the option furigana_selectable: true/false in the front matter to make the furigana text selectable (true) or unselectable (false or unset).
  • Use the option furigana_mouse_over: true/false to only show Furigana on mouse over (true) or to show it all the time (false or unset).
  • Option to use color-tags to easily identify the kanji level or grade (Do not implement).
  • Multiple color themes (Do not implement).

Kramdown attribute syntax

These options can be applied per paragraph with the Kramdown attribute syntax:

{: furigana_mode="hiragana/katakana/romaji" furigana_selectable="true/false" furigana_mouse_over="true/fasle"}

The attribute syntax overrides the same option in the front matter.

For the options furigana_selectable and furigana_mouse_over:

  • If specified in the front matter add a toolbar just above the table of contents (~class=”toc”~) and inside the sidebar `class=”sidebar__right sticky”`. Then add a small toggle icon (different looks for on and off) for each option inside the toolbar so that they can be toggled on or off for the whole page.
  • If specified per paragraph in Kramdown attribute syntax, add the same toggle icon for each option just above the marked paragraph so that they can be toggled on or off only for that paragraph.

The HTML of the sidebar:

<aside class="sidebar__right sticky">
  <nav class="toc" aria-label="Table of contents">
    <header><h4 class="nav__title">
        <i class="fas fa-file-alt"></i> On this page</h4>
    </header>
  </nav>
</aside>

Use different icons for furigana_selectable and furigana_mouse_over, and each icon should still have different looks when on and off. Make the frgntr-global-toolbar prettier and more nicely integrated with the theme.

Use the following layout for the toolbar above the TOC:

Tools        <-- Title of the entire toolbar section, background color same as the "nav__title" of the TOC
Furigana     <-- Title of the Furigana section, background color same as current
icon1 icon2  <-- The icons, smaller, more compact, no background color

# Light dim border between the sections.

Integration with Existing Furigana Parsing

For the existing furigana parsing process dealing with the pattern:

(X|Y|Z:X1|Y1|Z1)

its processing logic should be kept intact, with these additional changes:

  • Now it belongs to the whole furigana parsing process which can be turned on or off using the option furigana: true/false as mentioned above.
  • If it’s turned off, convert all furigana patterns to normal texts without the pattern in the output, that is, (X|Y|Z:X1|Y1|Z1) becomes XYZ.
  • The pattern takes precedence over the automatic parsing, that is, a pattern should not be sent for automatic processing.

Example Outputs

Here is an example of how the Furiganator extension converts a piece of text to HTML when “Make furigana unselectable” and “Only show furigana on mouse hover” are turned on:

本日はお時間をいただき、ありがとうございます。私はマカオにあるマカオ理工大学を4年制の学士課程で卒業しました。3ヶ月後、鶴ヶ崎さんと会う。

HTML Output:

<ruby furiganated="true">本日<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">ほんじつ</rt>
  <rp>)</rp>
</ruby>はお<ruby furiganated="true">時間<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">じかん</rt>
  <rp>)</rp>
</ruby>をいただき、ありがとうございます。<ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">わたし</rt>
  <rp>)</rp>
</ruby>はマカオにあるマカオ<ruby furiganated="true">理工<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">りこう</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">大学<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">だいがく</rt>
  <rp>)</rp>
</ruby>を4 <ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">とし</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">せい</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">学士<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">がくし</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">課程<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">かてい</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">卒業<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">そつぎょう</rt>
  <rp>)</rp>
</ruby>しました。3<ruby furiganated="true">ヶ月<rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">かげつ</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable "></rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">つる</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable ">さき</rt>
  <rp>)</rp>
</ruby>さんと<ruby furiganated="true"><rp>(</rp>
  <rt class="frgntr_hidden frgntr_unselectable "></rt>
  <rp>)</rp>
</ruby>う。

When “Make furigana unselectable” and “Only show furigana on mouse hover” are both turned off:

<ruby furiganated="true">本日<rp>(</rp>
  <rt class="">ほんじつ</rt>
  <rp>)</rp>
</ruby>はお<ruby furiganated="true">時間<rp>(</rp>
  <rt class="">じかん</rt>
  <rp>)</rp>
</ruby>をいただき、ありがとうございます。<ruby furiganated="true"><rp>(</rp>
  <rt class="">わたし</rt>
  <rp>)</rp>
</ruby>はマカオにあるマカオ<ruby furiganated="true">理工<rp>(</rp>
  <rt class="">りこう</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">大学<rp>(</rp>
  <rt class="">だいがく</rt>
  <rp>)</rp>
</ruby>を4 <ruby furiganated="true"><rp>(</rp>
  <rt class="">とし</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="">せい</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">学士<rp>(</rp>
  <rt class="">がくし</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">課程<rp>(</rp>
  <rt class="">かてい</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true">卒業<rp>(</rp>
  <rt class="">そつぎょう</rt>
  <rp>)</rp>
</ruby>しました。3<ruby furiganated="true">ヶ月<rp>(</rp>
  <rt class="">かげつ</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class=""></rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="">つる</rt>
  <rp>)</rp>
</ruby><ruby furiganated="true"><rp>(</rp>
  <rt class="">さき</rt>
  <rp>)</rp>
</ruby>さんと<ruby furiganated="true"><rp>(</rp>
  <rt class=""></rt>
  <rp>)</rp>
</ruby>う。

Implementation

To achieve a seamless, lightning-fast, and precise automatic Furigana injection using the exact same technologies as Furiganator, we need to architect a Ruby-to-Node.js bridge.

Spawning a new Node.js process for every single paragraph or file would be extremely slow because Kuromoji’s dictionaries take hundreds of milliseconds to load. Instead, the solution uses a persistent Node.js worker that boots up once per Jekyll build, loads the dictionary, and sits idly waiting for arrays of text jobs over stdin.

Here is the complete implementation with the required comments and explanations.

1. Install Node.js Dependencies

Navigate to your plugin’s directory and initialize the Node dependencies. Kuroshiro and its analyzer are strictly required for the Node worker.

cd lib/minimal_mistakes_plus
npm init -y
npm install kuroshiro kuroshiro-analyzer-kuromoji

2. The Node.js Worker (kuroshiro_worker.js)

Create a new file lib/minimal_mistakes_plus/kuroshiro_worker.js. This is the high-performance daemon that parses Japanese morphology via Kuromoji and converts it to HTML Ruby text.

3. The Updated Ruby File (furigana.rb)

Overwrite your existing lib/minimal_mistakes_plus/furigana.rb.

This code acts as the director. It handles global/local settings, strips out manual syntax when toggled off, acts as an isolation barrier for <code>=/=<pre> tags so Kuroshiro doesn’t parse code blocks, and securely streams strings back and forth with the Node worker.

4. GitHub Actions

You should commit the following files:

kuroshiro_worker.js
package.json
package-lock.json

You must not commit the node_modules folder. It is already included in your .gitignore file, which is the correct standard practice.

However, because you are not committing node_modules, there is a hidden trap for GitHub Actions: when your workflow pulls your plugin as a Ruby Gem to build the site, the Node.js dependencies will be missing, causing the Jekyll build to crash when it tries to spawn the worker.

To solve this flawlessly without bloating your Git repository or requiring users to modify their GitHub Actions YAML files, you can add a tiny “auto-install” hook to your Ruby script. GitHub Actions (ubuntu-latest) has Node.js pre-installed, so your plugin can simply instruct it to download the dictionaries automatically on the very first run.

--- lib/minimal_mistakes_plus/furigana.rb
+++ lib/minimal_mistakes_plus/furigana.rb
@@ -20,6 +20,12 @@
           @disabled = true
           return
         end
+
+        # Auto-install Node dependencies if missing (e.g., in GitHub Actions or fresh gem install)
+        unless Dir.exist?(File.join(File.dirname(__FILE__), 'node_modules'))
+          Jekyll.logger.info "Furigana:", "Installing Node.js dependencies. This may take a moment..."
+          system("npm install", chdir: File.dirname(__FILE__))
+        end
 
         @stdin, @stdout, @stderr, @wait_thr = Open3.popen3("node #{script_path}")

Bug Fixes

Characters not found in the dictionary

Some characters are parsed incorrectly, such as 箇 and 个:

<p>Some characters are parsed incorrectly, such as <ruby furiganated="true"><rp>(</rp><rt class="frgntr_unselectable"></rt><rp>)</rp></ruby> or <ruby furiganated="true"><rp>(</rp><rt class="frgntr_unselectable"></rt><rp>)</rp></ruby>:</p>

whose Furigana is added as itself. Is this caused by the kuroshiro backend or our own codes?

This happens because of the underlying Kuromoji dictionary used by the Kuroshiro backend.

When Kuromoji encounters a single kanji character or variant that it doesn’t have a specific contextual reading for (or treats as an unknown word), it defaults to returning the surface character itself as the reading. Kuroshiro then sees a kanji character, sees a reading attached to it, and faithfully wraps it in a <ruby> tag, resulting in <ruby>箇...<rt>箇</rt>...</ruby>.

We can easily fix this by filtering the HTML output. We’ll add a cleanup step to strip the <ruby> tags whenever the base text perfectly matches the reading text. We need to do this on both the Ruby backend (for the static build) and the JavaScript frontend (for the live converter).

Furigana Converter

Now I want to be able to add an input box in a post, with which you can paste Japanese text into it and the page can then instantly display Furigana for the text just below the box. The parsing process should utilize the same mechanism we’ve implemented in kuroshiro_worker.js and furigana.rb.

For example,

# The input box, whose width is as wide as the post content, height is 3 lines
# high by default and expands automatically as the content grows, no scroll bar.
{% include furigana-converter placeholder="Enter you contents..." %}

The text with furigana goes here...

Because Jekyll is a static site generator, furigana.rb and kuroshiro_worker.js only run on your computer during the build phase. Once the site is published, it becomes purely static HTML and cannot send text back to the Ruby script.

To achieve instant, live parsing inside the browser, we need to load the exact same Kuroshiro and Kuromoji engines directly into the client via a CDN. The logic will be identical to your backend worker, and it will even detect and apply your global visibility/selectability toggles.

Create a new file at _includes/furigana-converter.html with the following contents (see the source file in the repository).

Then add to your post:

{% include furigana-converter.html placeholder="Enter your contents..." %}

I see the message “Downloading dictionary (first time only, ~18MB)…” when I first click the input box every time after I refresh the page. Is it actually downloading the libraries every time you refresh the page? Can we cache the libraries so you don’t have to download it every time after a normal refresh, but only when you force a refresh?

Actually, your browser is already caching those files automatically.

The CDN we are using (cdn.jsdelivr.net) sends aggressive caching headers. On your first visit, the browser downloads the 18MB dictionary. On subsequent normal refreshes, the browser fetches those files directly from its local disk cache, which requires zero internet bandwidth.

The delay you are experiencing is the initialization phase. Every time you refresh the page, the browser’s JavaScript memory is wiped clean. When you click the input box, the Kuromoji engine has to read the cached files, decompress them, and rebuild its complex text-parsing data structures (a double-array trie) in your computer’s RAM. This computation takes a second or two, even on fast devices.

Because it’s impossible to persist instantiated JavaScript memory states across page reloads without deeply rewriting the library to support IndexedDB serialization, this brief loading phase is unavoidable.

To prevent this confusion, we should update the wording in the status message to reflect what is actually happening:

--- lib/minimal_mistakes_plus/furigana-converter.rb
+++ lib/minimal_mistakes_plus/furigana-converter.rb
@@ -66,7 +66,7 @@
             kuroshiroInitializing = true;
             if (statusEl) {
                 statusEl.style.display = 'block';
-                statusEl.innerText = "Downloading dictionary (first time only, ~18MB)...";
+                statusEl.innerText = "Loading dictionary engine...";
             }
 
             try {

Bug fixes

Kuromoji Resources Not Found

On the GitHub Actions webpage when clicking on the Furigana Converter input box, the whole page freezes. Errors reporting kuromoji resources 404 (Not Found) can be seen in the console, as shown in the screenshot. It works fine when running locally. I think it’s because the CDN links are wrong, such as https://josephtesfaye.com/cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/base.dat.gz. How does this happen? Fix it.

Gemini (Pro):

This is a deep, notorious quirk caused by a combination of Kuromoji’s internal code and the browser’s URL resolution rules. Here is exactly what is happening:

  1. The path.join Bug: Inside the Kuromoji library, the loader combines your dictPath with the filename using a browser polyfill of Node’s path.join(). This polyfill strictly normalizes paths and strips out double slashes. It takes “https://cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/” and turns it into “https:/cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/base.dat.gz” (with only one slash after https:).
  2. Why it worked locally (HTTP): Your local server runs on http://. When the browser’s XMLHttpRequest tries to fetch https:/cdn..., it sees that the scheme (https) doesn’t match your local server (http). Because it’s a cross-scheme request, the browser safely auto-corrects the missing slash and fetches the dictionary from the external CDN.
  3. Why it failed in Actions (HTTPS): GitHub Pages runs on https://. When the browser sees the request to https:/cdn..., the schemes match perfectly. According to standard web specification, a matching scheme with a single slash is treated as a relative absolute-path on the current domain. Therefore, the browser forcefully appends it to your site, resulting in the 404 error at https://josephtesfaye.com/cdn.jsdelivr.net/…!

Translator

Requirements

Now I also want Furigana Converter to be able to show translation (in English or a selected language) for the input text. The source text (with furigana) and translation are shown in pairs sentence by sentence, with the translation on a new line just below the source.

Similarly, these options are used in the front matter:

translate: true/false
translate_targets: [English, Chinese]
  • translate: Toggle the translation on (true) or off (false or unset).
  • translate_targets: Specify the languages into which the converter can translate.

When translation is toggled on, show an icon in a TRANSLATION row just above the FURIGANA row in the toolbar. When the icon is clicked a dropdown list of the translate_targets values is shown in the same order, and the icon changes to correspond to the selected language (i.e., each language has its own icon). You can select none of the languages, in which case the translation is toggled off and the icon changes to a corresponding look. The first value in translate_targets is the selected language by default, not OFF.

But before we start coding I want to discuss the feasibility of this feature. At first thought, I want the translation to work offline. The question is, is there a mature offline translator which gives you reliable translations? Would it be too large and too reliant on computational power to be loaded via CDN and hence impractical for my purpose? If we have to use online translators such as Google, Microsoft, DeepL, etc, do they provide APIs for third-party usage? Finally, can we use AI services from Google, OpenAI, etc.? But I want to keep the translation feature lightweight and free (no cost).

Discussion 1

This is an excellent architectural question. Adding live translation to a purely static site (like Jekyll on GitHub Pages) introduces a very specific set of constraints because you do not have a backend server to process data or hide secrets.

Here is a breakdown of the feasibility of each approach, balancing your requirements for reliability, cost, and payload size.

The Offline Approach (Client-Side AI)

It is actually possible to run translations entirely offline in the browser using WebAssembly and libraries like Hugging Face’s transformers.js. * How it works: The browser downloads a quantized translation model (like Opus-MT/MarianMT or a tiny LLM) via CDN and runs the neural network directly on the user’s CPU/GPU. * The Problem (Size): These models are inherently heavy. A specialized translation model (e.g., Japanese to English) is typically 70MB to 150MB. If you want Japanese to Chinese as well, that is another 70MB+ download. ​* Verdict: While it is completely free and private, it drastically violates the “lightweight” requirement. It will freeze the browser tab during initialization and consume significant mobile data.

Traditional Cloud APIs (DeepL, Google Translate, Microsoft)

These services provide the most reliable, fastest, and accurate translations. They all offer APIs and generous free tiers (e.g., DeepL offers 500,000 characters per month for free). * The Problem (Security): To use these APIs, you must send an API key with your request. Because your site is static HTML/JS, you would have to hardcode your API key into your JavaScript file. * The Risk: Bots constantly scrape GitHub and public websites for exposed API keys. Within hours, someone could steal your key, blow through your free tier limit, and either lock your account or charge your credit card. * Verdict: Never put API keys in client-side JavaScript. This makes direct API integration impossible for a static site without a proxy.

AI Services (OpenAI, Google Gemini, Anthropic)

Using LLMs provides fantastic, context-aware translations (often better than traditional translators for Japanese). Google Gemini, for instance, has a very generous free tier. * The Problem (Security): Exactly the same as the traditional APIs. Exposing an OpenAI or Gemini API key in your client-side JavaScript is a massive security risk. Both Google and OpenAI actively scan GitHub and will automatically revoke your key the moment they detect it in your public repository. * Verdict: Unusable directly from the browser for public static sites.

Free, Keyless Web APIs (The “Hack” Approach)

There are community-hosted proxy APIs (like Lingva Translate or public instances of LibreTranslate) that attempt to scrape Google Translate or run open-source models for free without requiring API keys. * The Problem (Reliability & CORS): These free instances are constantly overloaded, frequently go offline, and often suffer from CORS (Cross-Origin Resource Sharing) blocks, meaning your browser will refuse to load the data for security reasons. Furthermore, if you try to hit Google Translate’s web endpoints directly via JS, Google will block you with a 429 Too Many Requests or CORS error. * Verdict: Too unreliable for a polished feature.

Since a purely static frontend cannot safely hold API keys, and local browser models are too large, you have two realistic paths forward:

Path A: The Serverless Proxy (Most Reliable & Free) You can easily spin up a free “Serverless Function” using Cloudflare Workers, Vercel, or Netlify. 1. You create an account on Cloudflare (free) or Vercel (free). 2. You write a 20-line JavaScript function that holds your Google Gemini or DeepL API key securely on their servers. 3. Your Furigana Converter sends the Japanese text to your serverless function URL. 4. The function talks to Gemini/DeepL, gets the translation, and sends it back to your site. * Pros: 100% secure, highly reliable, and remains entirely free (using Cloudflare’s free tier + Gemini/DeepL’s free tier). Keeps your frontend incredibly lightweight. * Cons: Requires setting up an external account (Cloudflare/Vercel) outside of your Jekyll repository.

Path B: The Browser Native API (The Future, but Experimental) Google Chrome has recently started building an AI translation model directly into the browser itself (window.translation API). * Pros: Zero network requests, instant, completely free, zero download (it’s built into Chrome). * Cons: It is currently an experimental feature. It only works on recent versions of Chrome/Edge, requires users to have certain flags enabled, and does not work on Safari or Firefox yet.

If you want a robust, production-ready feature today that stays free and lightweight, Path A (Serverless Proxy with a free Gemini or DeepL API key) is the standard architectural best practice. We can easily write the proxy code and the frontend Jekyll integration if you are open to using Cloudflare Workers or Vercel.

Discussion 2

Is there non-LLM or non-AI solution for local translation, like a command-line tool or something whose output can be utilized by other apps? What are the best ones?

To answer this, we need to clarify a quick technical distinction: truly “non-AI” translation is called Rule-Based Machine Translation (RBMT). RBMT relies entirely on hardcoded grammatical dictionaries. For structurally identical languages (like Spanish to Catalan), it works okay. For Japanese to English, RBMT produces completely unintelligible gibberish.

However, if by “non-LLM” you mean you want to avoid massive, resource-heavy models (like ChatGPT or LLaMA) and instead want a lightweight, dedicated translation engine that runs locally on your CPU via the command line, there are excellent options.

Here are the best command-line translation tools whose outputs can be easily piped into other applications or parsed by a Ruby script.

Argos Translate (True Local Offline)

Argos Translate is an open-source offline translation engine built on OpenNMT (Neural Machine Translation). It is exactly what you are looking for: it uses small, highly optimized models specifically trained for single language pairs (about 30MB to 50MB per language), requiring no GPU and running entirely offline.

  • How it works: It is a Python package that installs a command-line utility.
  • CLI Usage: =argos-translate –from ja –to en “こんにちは”=
  • Integration: You can easily call it from your Jekyll Ruby plugins using Open3.capture2, capturing the standard output directly into your HTML generation process.

Translate-Shell (The API Wrapper)

If you want the high accuracy of Google Translate, DeepL, or Bing without dealing with API keys or cloud accounts, translate-shell (formerly Google Translate CLI) is the gold standard for command-line users.

  • How it works: It is a pure shell script (Awk) that acts as a command-line interface for public web translators. It scrapes the web endpoints invisibly.
  • CLI Usage: trans -b ja:en "本日"= (The =-b flag returns “brief” output, which is just the raw translated string, perfect for scripts).
  • Integration: Extremely fast and native to Unix environments. The only downside is that if your build script sends thousands of requests in a few seconds, the IP might temporarily be rate-limited by the engines.

Bergamot / Marian NMT (The Browser-Native Engine)

Marian NMT is the engine written in pure C++ that powers Mozilla Firefox’s new local, offline translation feature. It is incredibly fast and designed specifically for consumer CPUs rather than server GPUs.

  • How it works: While it has a command-line interface, its real power is that it is compiled to WebAssembly.
  • Integration: This is actually what you would use if you wanted to embed an offline translator directly into a webpage without a server. However, as discussed previously, the WASM binaries and language packs still require the user to download around 30MB-50MB of files when they visit the page.

The Architectural Reality Check for Jekyll

While these command-line tools are perfect for local text processing, we need to address how they fit into your specific Jekyll environment:

Scenario A: Build-Time Translation (Works perfectly) If you are translating the static content of your posts (e.g., adding translate: true to the front matter and having a Ruby plugin parse the Japanese text and generate English text below it), these CLI tools are the perfect solution. Your furigana.rb script can pass the Japanese strings to argos-translate via the shell, and bake the English text permanently into the final HTML during the build phase.

Scenario B: Live Converter Input Box (Will not work) If your goal is for the Live Furigana Converter to translate text that a user pastes into the browser after the site has been published, a local CLI tool cannot help you. 1. GitHub Pages only hosts static HTML/JS; it cannot run your local Mac command-line tools. 2. The user’s browser cannot execute shell scripts.

For a live, in-browser input box to generate translations instantly on a static site, you are strictly limited to either injecting a heavy WebAssembly model into the client’s browser, or passing the text to an external cloud proxy via JavaScript.

Discussion 3

Does translate-shell require an API key? If not, why do we need a key in the previous discussion above using Google or Microsoft Translate? Can we go by without a key, but just with what’s provided to the public for free? I remember seeing some websites using Google Translate to do multi-language support for themselves. It seems to be a widget or something. Can we use it also? Can we use it to translate just a certain area instead of the whole page?

Why translate-shell Doesn’t Need a Key

translate-shell bypasses the need for an API key by acting as a web scraper. Instead of communicating with Google’s official developer API, it constructs an HTTP request that mimics a normal user typing text into the public translate.google.com website. It then intercepts and parses the raw JSON responses that Google’s frontend sends back to the browser.

Because it pretends to be regular web traffic, Google provides the translation for free, just as it would if you manually visited the site.

Why We Can’t Replicate This in the Browser

If translate-shell can scrape the public website for free, you might logically wonder why our client-side JavaScript can’t just do the exact same thing.

The barrier is CORS (Cross-Origin Resource Sharing). Modern browsers enforce strict security policies to prevent malicious scripts from interacting with other websites behind the scenes. If your JavaScript on josephtesfaye.com tries to silently fetch data from translate.google.com, the browser will instantly block the request because the origins do not match. translate-shell runs in your terminal, which does not enforce browser security policies like CORS, giving it total freedom to scrape data.

The Google Translate Widget

The widget you are remembering is the Google Translate Website Translator. While it seems like a perfect fit, there are two major issues with using it for your specific use case:

  1. It is Deprecated: Google officially retired this tool for new websites in 2019. While legacy scripts still float around the internet and technically function, it is unsupported, unmaintained, and could be completely shut down by Google at any time.
  2. Architectural Incompatibility: The widget is designed to forcefully mutate the DOM and replace the text in place. It does not provide an API to simply “return” the translated string back to your code. Because your goal is to pair the source text and translation sentence by sentence (placing the English on a new line below the Japanese), the widget will fail. It would simply overwrite your Japanese text with English text rather than giving you the raw string to format yourself.

/(Note: You can technically restrict the widget to a certain area by wrapping the rest of the page in a notranslate CSS class, but it still suffers from the DOM-mutating behavior mentioned above.)/

The Best Approach Forward

To achieve your exact layout—a clean, instant translation returned as a string that we can format underneath the Furigana—without exposing API keys or battling CORS blocks, we need a proxy.

The fastest, most efficient, and best approach is to create a Cloudflare Worker. It is a 100% free, serverless function that takes minutes to set up. 1. Your frontend sends the Japanese text to your Worker URL. 2. The Worker (which safely holds a free Gemini or DeepL API key on Cloudflare’s secure servers) requests the translation. 3. The Worker returns the pure translated string back to your Jekyll site for formatting.

Would you like me to walk you through writing the tiny Cloudflare Worker script and integrating it into our Live Converter logic?

Discussion 4

Yes, show me how to do it step by step with Google Translate and Cloudflare Worker, and start making changes in minimal-mistakes-plus.

This is the perfect scenario for a Cloudflare Worker. We will create a tiny serverless proxy that takes your Japanese text, forwards it to Google Translate’s public, keyless web endpoint (the same one used by translate-shell and browser extensions), and returns the raw string to your static Jekyll site.

This entirely bypasses browser CORS restrictions and requires zero API keys, keeping your site 100% free and lightweight.

Here is the step-by-step implementation.

Step 1: Create the Serverless Proxy (Cloudflare Worker)

  1. Go to Cloudflare Workers and sign up for a free account (if you don’t have one).
  2. Click Create Application -> Create Worker.
  3. Name it something like furigana-translator and deploy it.
  4. Click Edit code and replace the default code with this exact script (see the same commit of these contents):
  5. [@5] Click Deploy. Note the URL it gives you (e.g., https://furigana-translator.<your-username>.workers.dev). You will need this for Step 3.

Step 2: Add the UI Dropdown (furigana.rb)

We need to read your front-matter variables (translate and translate_targets) and inject a globe icon next to the Furigana settings in the toolbar.

For the code changes see the same commit of the contents.

Step 3: Implement Sentence Alignment (furigana-converter.rb)

To correctly pair the Japanese with the translation, the JavaScript needs to check if translation is active. If it is, we split the text into sentences, convert each to Furigana, fetch the translation for each from your Cloudflare Worker, and stack them.

(*Important:* Make sure to replace =YOUR_CLOUDFLARE_WORKER_URL= on line =202= with your actual URL).

For the code changes see the same commit of the contents.

Discussion 5

In step 1 I chose build from a github repo named translate-worker, which is empty currently. Show me how to complete this set up step by step.

Connecting your Cloudflare Worker to a GitHub repository is actually the best way to do this. It sets up automatic CI/CD, meaning any future changes you make to the proxy code in your repository will automatically deploy to Cloudflare.

Since your repository is currently empty, Cloudflare doesn’t know what to deploy yet. We need to populate it with the standard file structure for a Cloudflare Worker. You can do this by cloning the repo to your local machine or just creating the files directly on the GitHub website.

Here is exactly how to complete the setup step-by-step.

Step 1: Add the Configuration File

Cloudflare looks for a configuration file called wrangler.toml in the root of your repository to understand how to build your worker.

Create a file named ==wrangler.toml== in the root folder of your translate-worker repository and paste this inside:

name = "translate-worker"
main = "src/index.js"
compatibility_date = "2024-04-06"

Step 2: Add the Proxy Code

Next, create a folder named ==src==. Inside that folder, create a file named ==index.js==.

Paste the proxy code we discussed earlier into ==src/index.js==:

For the code changes see the same commit of the contents.

Step 3: Push to GitHub

Commit these two files (wrangler.toml and src/index.js) and push them to your main branch on GitHub. Your repository structure should now look exactly like this:

translate-worker/
├── wrangler.toml
└── src/
    └── index.js

Step 4: Complete Cloudflare Setup

  1. Go back to the Cloudflare dashboard screen shown in your screenshot.
  2. Click the Begin setup button.
  3. Because you have a wrangler.toml file in the repository, Cloudflare should automatically detect the settings.
  4. If it asks you for a Build Command, leave it blank.
  5. If it asks for the Entry point or Main file, ensure it points to src/index.js.
  6. Click Save and Deploy.

Cloudflare will take a few seconds to build and deploy. Once it finishes, it will give you a public URL (e.g., https://translate-worker.<your-username>.workers.dev). You will take that specific URL and paste it into line 209 of your ==furigana-converter.rb== file.

Discussion 6

Besides, I want to improve the sentence breaking logic. For an input like the above, it should remain one single line in the output, instead of 3 lines as done by the current logic. To be more specific, if a sentence is no more than 5 characters (excluding punctuation), continue the line with the next sentence. For emojis, always treat them as part of the sentence before them and never break them away on another line.

Discussion 7: How to prevent Too Many Requests error

The root cause of this error is rate limiting (Status 429: Too Many Requests), not your browser.

In our previous JavaScript logic, the Live Converter was designed to split the Japanese text into individual sentences and translate them one by one. If you paste a paragraph with 5 sentences, your browser fires 5 simultaneous POST requests to your Cloudflare Worker. The Worker then instantly fires 5 simultaneous requests to Google Translate from the exact same Cloudflare IP address. Google’s automated defenses flag this burst of traffic as a scraper, temporarily ban the IP, and return an HTML warning page instead of JSON—which crashes the Worker.

To make this completely bulletproof, we need to implement a Fallback Waterfall in your Cloudflare Worker. If Google blocks the primary endpoint, the Worker will instantly and invisibly switch to a secondary endpoint, and then a third community API if necessary.

Replace the entire contents of your index.js file in your translate-worker repository with this highly resilient version:

What about making changes to furigana_live.js so that only one request is sent to the Cloudflare Worker, and hence Google Translate, no matter how many sentences there are in the input? Then we do the sentence segmentation based on both the source text and the response. To do accurate segmentation, we can do it like this:

Before sending for translation, we normalize the source text:

A sentence is delimited by and only by these punctuation symbols 。!?. If none of them is found for a segment, complete it with the symbol. Remove newlines, if any. Remove spaces between CJK characters (including the CJK punctuation symbols).

For accurate segmentation of the translation to pair with the source text segment by segment, we insert a neutral symbol like @ between the segments of the source text, so that the translation also contains these symbols and hence the same segments.

Meanwhile, keep the fallback waterfall logic:

To make this completely bulletproof, we need to implement a Fallback Waterfall in your Cloudflare Worker. If Google blocks the primary endpoint, the Worker will instantly and invisibly switch to a secondary endpoint, and then a third community API if necessary.

How can you test index.js locally without deploying to Cloudflare?

The best and officially supported way to test Cloudflare Workers locally is by using Wrangler, the official Cloudflare CLI. It spins up a local server on your machine that perfectly simulates the Cloudflare environment.

Since you already have your wrangler.toml and src/index.js set up, you are perfectly positioned to do this.

Here is how you can test your worker locally step-by-step:

Step 1: Install Wrangler

You will need Node.js installed on your computer. Open your terminal, navigate to your translate-worker directory (where your wrangler.toml lives), and run:

npm install -g wrangler

Alternatively, you can just use npx wrangler in the next step without installing it globally.

Step 2: Start the Local Development Server

In the same terminal directory, run the following command:

wrangler dev

Wrangler will read your wrangler.toml, compile your index.js, and start a local server. It will usually output a local URL for you to use, typically: http://localhost:8787

Bonus: This server supports hot-reloading. If you make changes to your index.js and save the file, the local server will instantly update without needing a restart!

Step 3: Test the API Locally

You can test if it’s working using a standard terminal curl command, or by temporarily pointing your Jekyll site to the local server.

Option A: Test via Terminal (cURL)

While wrangler dev is running in one terminal window, open a new terminal window and send a test POST request:

curl -X POST http://localhost:8787/ \
-H "Content-Type: application/json" \
-d '{"text": "こんにちは", "target": "en"}'

You should instantly see a JSON response like {"translated":"Hello"}.

Option B: Test via Your Frontend (=furigana_live.js=)

If you want to test the entire flow inside your browser:

  1. Open your assets/js/furigana_live.js file.
  2. Find the fetch call (around line 208) and temporarily change the URL from your Cloudflare production URL to your localhost:

    -const res = await fetch("https://translate-worker.josephtesfaye022.workers.dev", {
    +const res = await fetch("http://localhost:8787", {
  3. Run your Jekyll site locally.
  4. Type Japanese into the Live Converter on your local site. Your browser will send the request to your local Wrangler server instead of Cloudflare.

Just remember to change the URL back to your .workers.dev address before you commit and push to GitHub!

Tests

  • This should not produce 500 internal error:

    依頼データ抽出終了の処理についてですが、ログメッセージを出力する際、 common/message.pyに新しいメッセージを定義すべきでしょうか?

Tasks

In the following example, we need some fixes or changes:

  • The line 本チャットで依頼(レビュー者は誰!?) is currently segmented, which shouldn’t be. For the content in a pair brackets () if it has these segmentation marks 。!? before the closing bracket the segementation should be done after the closing bracket so that it doesn’t break away on the next segment. In addition, a line like 本チャットで依頼(レビュー者は誰!?ま だわからない。) shouldn’t be segmented either even though there are multiple segmentation marks in between, all because we want to keep the brackets in the same segment. Note that if there are no segmentation marks in the brackets there should be no segmentation either, such as 以下にPh1(現在はPh2)の開 発時.
  • If a line starts with one of these symbols *#※■ it’s mostly intended on its own line. So we should always treat such a line as a segment, that is, it never joins the previous or next segment.
  • Write tests!
  • Currently the translation process blocks the furigana process, that is, the furiganized text isn’t shown until the translation is completed. We should prioritize the furigana process. Don’t let the translation process block it.
  • I want to make a history of the inputs and outputs. Each input and its output form a record in the history, and you can cycle through the records or jump to one.
  • You can download or save a record or multiple records, and paste them to a post in Org or Markdown.
  • I want to be able to hide or show the translations with a control in the toolbar. The control has three states:

    • Show: Show all translations
    • Hide: Hide all translations without affecting other elements, say, the source texts stay in place so that you know the translations are still there but just hidden.
    • Collapse: Hide all translations by collapsing the spaces between the source texts as if there’s nothing between them.

Updated: