Your Phone Is a Free Always-On LLM Server: Wiring Gemma 4 on a Pixel into n8n, Home Assistant, and Beyond

Most homelabs already own the cheapest 24/7 ML peer they'll ever have. It's sitting on a charging pad right now. A Pixel 9 Pro running OlliteRT with Gemma 4 E4B is a functional phone LLM server — zero cost per call, sub-3W overnight draw, OpenAI-compatible API accessible over Tailscale. This post is the complete wiring guide: keeping the inference process alive through Android Doze, exposing a stable endpoint to your NAS, and plugging it into n8n and Home Assistant for real work.

The Stack

Before anything else, here's what's actually running:

  • OlliteRT — Android app that runs GGUF models via a llama.cpp Android port. Exposes an OpenAI-compatible HTTP server on port 11434. Supports Gemma 4 E4B natively.
  • Gemma 4 E4B — Google's 4-billion-parameter model in E4B quantisation. Fits comfortably in a Pixel 9 Pro's 16 GB RAM. The multimodal variant handles vision tasks; the text-only variant is marginally faster for pure-text work.
  • Automate (LlamaLab) — Android automation app. You'll use it to run a background HTTP listener that detects when Android Doze has killed OlliteRT and relaunches it. This use case isn't in the docs — you build it yourself.
  • Tailscale — Zero-config mesh VPN. No port forwarding, no dynamic DNS, no reverse-proxy cert management. Free tier covers up to 100 devices.
  • n8n — Running on the NAS via Docker Compose. Orchestrates every workflow that calls the phone endpoint.

Exposing the Phone LLM Server via Tailscale

Install OlliteRT from the Play Store, pull Gemma 4 E4B from within the app, and start the built-in server. By default it binds to localhost:11434 — useless remotely. Go to OlliteRT settings → Server → Listen Address and set it to 0.0.0.0. This was buried in a GitHub issue thread, not the settings screen's tooltip.

Then install Tailscale on both the phone and your NAS. On Synology it runs as a Package Centre install. Once both devices are enrolled, note the phone's Tailscale IP (100.x.x.x) and verify from the NAS:

curl http://100.x.x.x:11434/api/tags

If you get a response listing available models, the network path works. If you get connection refused, the listen address change didn't stick — force-stop OlliteRT and reopen it.

Surviving Android Doze: The Automate Watchdog

Here's the part the OlliteRT README glosses over. Android Doze will kill the OlliteRT process after 20–30 minutes of screen-off idle, even with battery optimisation disabled for the app. You need an external watchdog.

The solution is an Automate flow running a lightweight HTTP server on port 11435. Every n8n workflow hits this wake endpoint first. The flow checks whether OlliteRT is responding; if not, it fires an Android intent to relaunch it.

Build this flow in Automate:

Trigger: HTTP Request received on port 11435
  → HTTP Request: GET http://localhost:11434/api/tags
    → If status == 200:
        HTTP Response 200 {"status":"ok"}
    → Else:
        Shell: am start -n com.oliteai.ollitert/.MainActivity
        Wait: 5 seconds
        HTTP Response 200 {"status":"waking"}

Automate needs its background execution permission granted, and the flow needs the "Keep running" toggle enabled. On Pixel, long-press the Automate thumbnail in recents and pin it — this meaningfully reduces kill probability. Also add a boot trigger flow so Automate relaunches itself after a phone reboot; without this, a midnight reboot (system update, etc.) silently takes everything offline until morning.

One thing that tripped me up for longer than I'd like to admit: OlliteRT takes 8–12 seconds to reload the model from a cold relaunch. If your n8n probe step just fires once and immediately proceeds, you'll hit a 503 on the actual LLM call. Use n8n's built-in retry with a 5-second interval and at least 3 attempts on the probe HTTP node.

The n8n Integration: Probe-and-Wake + Concurrency Guard

Every workflow that calls the phone needs two things upstream of the actual LLM call.

Probe-and-wake node — HTTP Request node, placed first in every LLM workflow:

Method: GET
URL: http://{{ $env.PHONE_TS_IP }}:11435/
Retry on failure: true
Max retries: 3
Retry interval: 5000

Concurrency guard — Gemma 4 E4B on OlliteRT is single-stream. Simultaneous calls don't hard-fail, but they serialise internally and you occasionally get truncated output when three requests land at once. Set the workflow's concurrency to 1 in Workflow Settings → Execution settings → Concurrency. For webhook-triggered workflows (email arrival, Immich events), also enable queue execution mode.

The actual LLM call is a standard OpenAI-compatible POST:

POST http://{{ $env.PHONE_TS_IP }}:11434/v1/chat/completions
Content-Type: application/json

{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "system", "content": "{{ $json.systemPrompt }}"},
    {"role": "user", "content": "{{ $json.userContent }}"}
  ],
  "temperature": 0.2
}

What This Layer Is Actually Good For

I've been running this against a Synology DS923+ for three months. Everything below runs on the phone without touching NAS CPU.

Immich photo captioning and AI tagging. Immich's smart search is embedding-based, not generative — it can't label photos against a custom taxonomy. A nightly n8n workflow pulls untagged assets from the Immich API, base64-encodes each JPEG, and posts it to OlliteRT's vision endpoint with a prompt requesting tags from a fixed list (people, food, travel, outdoor, document, receipt, screenshot, pet). Tags write back via the Immich API. Throughput on a 40k library is roughly 300–400 photos per hour overnight. Slow, but the NAS CPU stays idle and it costs nothing.

Email triage and inbox summary. Morning workflow pulls 24 hours of Gmail threads, summarises each to a single sentence, and classifies by urgency: action-needed, FYI, or noise. Output is a Markdown digest pushed to Telegram. Gemma 4 handles this reliably — straightforward text work well within the model's comfort zone.

RSS digest. Fifty RSS items into one n8n execution: each item gets a one-line summary and a relevance score against a fixed interest profile. Result lands in Linkding at 06:00 as a read-later list. I stopped manually skimming feeds the week I turned this on.

Receipt and transaction categorisation. A banking webhook fires on every posted transaction. The LLM classifies the merchant against a fixed category list (groceries, fuel, utilities, dining, subscriptions, etc.) and writes the result to Notion. Real-world accuracy is around 92% — good enough to review weekly rather than categorise every entry manually.

Voice memo summarisation. Audio shares from the phone's recorder hit an n8n webhook → Whisper on the NAS for transcription → OlliteRT for structured extraction (title, action items, key points). Whisper handles the heavy lifting; the phone only processes clean text on the back end.

Home Assistant LLM intents. HA's conversation agent accepts a custom OpenAI-compatible endpoint. Point it at the phone's Tailscale IP. Commands that don't match built-in intents route to Gemma 4, which returns structured intent JSON. Response latency is 1–3 seconds on a Pixel 9 Pro — acceptable for voice control of lights and switches.

Form-to-structured-JSON. Any n8n workflow that ingests messy natural-language input — support messages, feedback forms, Slack forwards — can pass it through the phone endpoint for structured extraction before routing. The system prompt defines the JSON schema; the model fills it. No additional service, no API bill.

Limitations You Should Know Before Committing

This is not a replacement for a dedicated inference box. The context window on Gemma 4 E4B is 8k tokens — fine for the tasks above, a hard wall for anything requiring large document summarisation. Throughput is single-stream. Sustained load heats the phone noticeably (though it handles it fine on a Pixel 9 Pro). Phone upgrades reset Tailscale enrolment, so a new device needs a fresh setup.

For parallel inference, sub-second vision latency, or context windows above 32k, you want a proper GPU node. But for async overnight queues — the ones where you just need a capable model chewing through a backlog while you sleep — a charging Pixel is the right tool and you already own it.

The Takeaway

The self-hosted LLM inference layer most homelabs are missing doesn't require new hardware. An Android phone on Tailscale, with OlliteRT serving Gemma 4 E4B and an Automate watchdog keeping it alive through Doze, is a working OpenAI-compatible endpoint. Wire it into n8n with a probe-and-wake pre-step and a concurrency=1 guard, and you have free inference for every background task your NAS runs — photo tagging, email triage, receipt classification, voice memo summarisation, Home Assistant intents, and anything else that needs cheap, synchronous LLM calls. Total hardware cost: a USB-C cable.

H