Tenant Migration Stuck on Advisory Lock

Alert / Symptom

User-visible symptom: requests for a specific tenant hang for ~10s and then fail with a tenant-migration timeout (finsta.tenant-migration.request-filter-timeout, default 10s). The on-demand request-filter migration path can’t acquire the Flyway lock for that tenant’s schema.

Secondary signal: the scheduled migration loop log line Migrate [N] tenants to latest version […] stops advancing — pending count stays flat in startup logs across pod restarts.

This runbook applies when the lock holder is a finsta JVM that is alive but hung (long GC pause, native deadlock, blocked on a slow query). A crashed pod releases the lock automatically when its TCP connection drops.

Background

SchemaRepository runs tenant migrations with Flyway’s postgresql.transactional.lock=false (see issue #1222 and plans/2026-05-07_1222__flyway-transactional-lock-multitenant-investigation.md). This swaps Flyway’s coordination lock from pg_advisory_xact_lock (transaction-scoped) to pg_advisory_lock (session-scoped).

Tradeoff: the session-scoped lock survives until the holding TCP connection actually closes. The PostgreSQL idle_in_transaction_session_timeout does not apply because the lock connection is no longer in a transaction. A hung-but-alive JVM can therefore hold the lock until TCP keepalives reap the dead connection (typically minutes) or the pod is killed.

Impact

Requests for the affected tenant hang and time out on the request-filter migration path.
The scheduled migration loop on every instance blocks when it picks the same tenant.
Other tenants are unaffected — the advisory lock key is per-schema (LOCK_MAGIC_NUM + qualified-table-name.hashCode()).

Diagnose

Find the stuck advisory lock and the connection holding it:

select l.pid,
       l.objid,
       l.objsubid,
       l.granted,
       a.application_name,
       a.client_addr,
       a.state,
       a.query_start,
       now() - a.state_change as held_for,
       a.query
from   pg_locks l
join   pg_stat_activity a on a.pid = l.pid
where  l.locktype = 'advisory'
order  by held_for desc nulls last;

A granted = true row with state = 'idle' (not idle in transaction) and application_name matching the finsta migrator (e.g. Flyway or the configured ApplicationName), held for many minutes, is the stuck lock.

Identify the finsta pod owning that connection by client_addr and the matching k8s pod IP:

kubectl --context <context> -n <namespace> get pods -o wide | grep <client_addr>

Confirm the pod is hung and not making progress: check liveness probe status, recent log activity, and JVM thread state (kubectl exec … — jstack 1 if available).

Mitigate

Preferred: restart the holding pod. The TCP connection drops, PostgreSQL releases the session lock, and the next migration attempt for that tenant proceeds.

kubectl --context <context> -n <namespace> delete pod <finsta-pod>

If restarting the pod is not possible (rare), terminate the specific backend in PostgreSQL:

-- only after confirming the pid is the stuck advisory lock holder
select pg_terminate_backend(<pid>);

pg_cancel_backend is not sufficient — it cancels the current statement but does not close the session, so the advisory lock stays held.

Prevent

Keep k8s liveness probes on finsta tight enough that a hung pod is killed within ~2 minutes.
Ensure JDBC tcpKeepAlive=true and PostgreSQL server-side tcp_keepalives_idle / tcp_keepalives_interval are short enough to reap dead TCP sessions in single-digit minutes.
If stuck-lock incidents become recurring rather than one-off, revisit the transactionalLock decision in SchemaRepository.tenantFluentConfiguration — flipping it back trades stuck-lock recovery time for the autovacuum-stall regression that issue #1222 was fixing.

Issue #1222 — Tenant migration performance: reduce per-tenant Flyway overhead at scale
plans/2026-05-07_1222__flyway-transactional-lock-multitenant-investigation.md — root-cause investigation
tritt.finsta.domain.tenant.SchemaRepository — where the lock mode is configured