Tenant Migration Stuck on Advisory Lock
Alert / Symptom
User-visible symptom: requests for a specific tenant hang for ~10s and then fail with a tenant-migration timeout (finsta.tenant-migration.request-filter-timeout, default 10s).
The on-demand request-filter migration path can’t acquire the Flyway lock for that tenant’s schema.
Secondary signal: the scheduled migration loop log line Migrate [N] tenants to latest version […] stops advancing — pending count stays flat in startup logs across pod restarts.
This runbook applies when the lock holder is a finsta JVM that is alive but hung (long GC pause, native deadlock, blocked on a slow query). A crashed pod releases the lock automatically when its TCP connection drops.
Background
SchemaRepository runs tenant migrations with Flyway’s postgresql.transactional.lock=false (see issue #1222 and plans/2026-05-07_1222__flyway-transactional-lock-multitenant-investigation.md).
This swaps Flyway’s coordination lock from pg_advisory_xact_lock (transaction-scoped) to pg_advisory_lock (session-scoped).
Tradeoff: the session-scoped lock survives until the holding TCP connection actually closes.
The PostgreSQL idle_in_transaction_session_timeout does not apply because the lock connection is no longer in a transaction.
A hung-but-alive JVM can therefore hold the lock until TCP keepalives reap the dead connection (typically minutes) or the pod is killed.
Impact
-
Requests for the affected tenant hang and time out on the request-filter migration path.
-
The scheduled migration loop on every instance blocks when it picks the same tenant.
-
Other tenants are unaffected — the advisory lock key is per-schema (
LOCK_MAGIC_NUM + qualified-table-name.hashCode()).
Diagnose
Find the stuck advisory lock and the connection holding it:
select l.pid,
l.objid,
l.objsubid,
l.granted,
a.application_name,
a.client_addr,
a.state,
a.query_start,
now() - a.state_change as held_for,
a.query
from pg_locks l
join pg_stat_activity a on a.pid = l.pid
where l.locktype = 'advisory'
order by held_for desc nulls last;
A granted = true row with state = 'idle' (not idle in transaction) and application_name matching the finsta migrator (e.g. Flyway or the configured ApplicationName), held for many minutes, is the stuck lock.
Identify the finsta pod owning that connection by client_addr and the matching k8s pod IP:
kubectl --context <context> -n <namespace> get pods -o wide | grep <client_addr>
Confirm the pod is hung and not making progress: check liveness probe status, recent log activity, and JVM thread state (kubectl exec … — jstack 1 if available).
Mitigate
Preferred: restart the holding pod. The TCP connection drops, PostgreSQL releases the session lock, and the next migration attempt for that tenant proceeds.
kubectl --context <context> -n <namespace> delete pod <finsta-pod>
If restarting the pod is not possible (rare), terminate the specific backend in PostgreSQL:
-- only after confirming the pid is the stuck advisory lock holder
select pg_terminate_backend(<pid>);
pg_cancel_backend is not sufficient — it cancels the current statement but does not close the session, so the advisory lock stays held.
Prevent
-
Keep k8s liveness probes on finsta tight enough that a hung pod is killed within ~2 minutes.
-
Ensure JDBC
tcpKeepAlive=trueand PostgreSQL server-sidetcp_keepalives_idle/tcp_keepalives_intervalare short enough to reap dead TCP sessions in single-digit minutes. -
If stuck-lock incidents become recurring rather than one-off, revisit the
transactionalLockdecision inSchemaRepository.tenantFluentConfiguration— flipping it back trades stuck-lock recovery time for the autovacuum-stall regression that issue #1222 was fixing.