Max Chandler

At a glance

Senior SRE (5+ years) with a PhD in Computer Science. Specialised in reliability, DNS, caching, secrets, and Kubernetes-based Go services. Experienced in managing high-scale, customer-facing systems handling thousands of requests per second, with deep expertise in on-call operations and incident management. Looking for a role focused on systems engineering, large scale services, and programming over operational work.

Areas of Expertise

Experienced with

Experience

Apple

2021-present Senior Site Reliability Engineer: Shazam User-facing services

Areas of Focus

The primary focus is on supporting our frontend APIs and GPU recognition infrastructure, preparing for big events, security, load balancing, rate limiting, logging, tracing, metrics, SLOs and alerting. Although this role focuses on our user-facing services, I also continue to do platform related work and led our CentOS migration to Debian.

As part of this role I regularly work with a variety of teams across Apple to help expand our reach. I help and advise other teams with policy, incident management, postmortems, system design, API reviews, and finding contacts for our projects in the wider business. My main focus in the SRE team is to help our infrastructure scale reliably, raise risks to senior management and identify areas we can automate or improve as the business grows, for example:

More recently, I have spent 3 months as a Go software developer on rotation with our recognition infrastructure team in San Diego, working on developing a new API and Envoy-based routing for a greenfield project. As part of this, I got to work closely with our developers and research team to define new core business logic, implement cross-region routing with Envoy for automatic failover, and upgrading our tracing libraries to be more detailed and support Open Telemetry.

2019-2021 Site Reliability Engineer: Shazam Platform team

In this role my focus was on platform development and on-call for platform related issues. This role was broad-ranging, from maintaining legacy systems, creating and managing our fleet of 120+ GKE Kubernetes clusters across 8+ regions, and our self hosted software: GitLab, Grafana, Prometheus, Vault, Splunk and more.

Areas of Focus

During my time I redesigned how we manage our Kubernetes clusters Vault & Thanos metrics infrastructure to handle our growing workloads, deployed and managed external-secrets, external-dns, Goldilocks for VPA recommendations, gatekeeper and cert-manager to improve and simplify infrastructure management. In this role, I was also responsible for the Google Cloud organisation as a whole, the creation and management of our networks, DNS, proxies and cross-cloud VPNs over to our AWS infrastructure.

I led a variety of projects including: moving our workloads to dynamic credentials with Vault, creating our own CAs and migrated our services to use cert-manager to eliminate certificate rotation toil, refactoring the structure of our GCP org to improve ease of management and security, implementing org policies to enforce security best practices, working with our GPU infrastructure team to implement tracing and metrics in C++ and Go for monitoring application start times and detecting hardware issues.

As part of this role I also helped to train others on technical aspects of systems we use, develop tools and dashboards to help developers understand and troubleshoot their workloads. I have also hosted security focused hack days to widen our security knowledge across our teams. We also regularly hosted individuals on career experiences, and interns to mentor and provide them with projects to give them an experience of SRE work.

Significant projects:

Cardiff University

2015-20 PhD Computer Science

New Methods in Quantification and RF Pulse Optimisation for Magnetic Resonance Spectroscopy

PhD in Computer Science, specialising in quantum control & distributed computing for MRI/NMR. Developed a parallel computing framework and applied several deep learning methods for spectral analysis. To help with the compute needs of the project I self-funded a 12 node blade server and self managed it as an HPC cluster in the university data centre. During this time, I also tutored undergraduates including a year-long group project focused on developing a product in a simulated professional team environment. I also taught Python, Java, C and web development as well as being a post-graduate student representative.

Skills Gained

Education

2015-20 Cardiff University PhD Computer Science (supervisor: Frank Langbein)

2012-15 Cardiff University BSc Computer Science (First class honours)

Hobbies

Awards

2015 Engineering and Physical Sciences Research Council (EPSRC) full postgraduate scholarship for PhD

2012 Cardiff University Research Opportunities Placement (CUROP) undergraduate summer research project stipend

References

Available upon request.