Distributed Systems & Reliability Engineer
Glydways
Who we are:
Glydways is reimagining what public transit can be. We believe that mobility is the gateway to opportunity—connecting people to housing, education, employment, commerce, and care. By making transportation more accessible, affordable, and sustainable, we empower communities to thrive and unlock economic and social prosperity.
Our mission is to revolutionize transit with a solution that delivers high capacity, exceptional user experiences, unmatched affordability, and minimal environmental impact.
The Glydways system is a groundbreaking network of carbon-neutral, interconnected transit pathways powered by standardized autonomous vehicles on dedicated roadways. Operating 24/7 with on-demand access, it offers personalized and efficient mobility—without the burden of heavy upfront infrastructure costs or ongoing taxpayer subsidies.
With Glydways, we’re building more than a transportation system; we’re creating a future where everyone, everywhere, has the freedom to move.
About the Role:
This role focuses on making Glydways' centralized planning system highly reliable, available, and restart-safe in real-world operations. You will own the design and implementation of high-availability patterns, state replication and recovery, and robust observability for our real-time planning and coordination services. You will work closely with autonomy, ops, and product teams to harden the existing behavior, ensure safe failover under faults, and drive down flaky and unsafe production states over time. This is a senior engineering role for someone who wants to be the technical owner of a reliable distributed system in production.
Responsibilities:
- Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters.
- Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned.
- Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting.
- Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states.
- Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way.
- Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence).
- Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary).
- Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work.
- Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees.
- Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.
Knowledge, Skills and Abilities:
- Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
- Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
- Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
- Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
- Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
- Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
- Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
- Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
- Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
- Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.
Glydways provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.