Max Chandler

hello@ma.ax | MaxChandler | Google Scholar | ma.ax

At a glance

Senior SRE (5+ years) with a PhD in Computer Science. Specialised in reliability, DNS, caching, secrets, and Kubernetes-based Go services. Experienced in managing high-scale, customer-facing systems handling thousands of requests per second, with deep expertise in on-call operations and incident management. Looking for a role focused on systems engineering, large scale services, and programming over operational work.

Areas of Expertise

GCP & AWS
Monitoring, Logging & Tracing
Distributed Systems
On-call & Incident Management
Kubernetes
Security

Experienced with

Golang, Python & Bash
Prometheus & Thanos
Fastly CDN & Varnish
Helm, Helmfile & Tanka
Terraform & Terragrunt
Puppet & Packer
Splunk & Fluentd
WAF & Rate limiting
Kubernetes services

Experience

Apple

2021-present Senior Site Reliability Engineer: Shazam User-facing services

Areas of Focus

Golang microservices
Kong, Google Cloud Armor (WAF), ingress-nginx
Fastly CDN, Varnish, Redis
Monitoring: Prometheus, Thanos
IaC: Terraform & Terragrunt

The primary focus is on supporting our frontend APIs and GPU recognition infrastructure, preparing for big events, security, load balancing, rate limiting, logging, tracing, metrics, SLOs and alerting. Although this role focuses on our user-facing services, I also continue to do platform related work and led our CentOS migration to Debian.

As part of this role I regularly work with a variety of teams across Apple to help expand our reach. I help and advise other teams with policy, incident management, postmortems, system design, API reviews, and finding contacts for our projects in the wider business. My main focus in the SRE team is to help our infrastructure scale reliably, raise risks to senior management and identify areas we can automate or improve as the business grows, for example:

Led a cost and CO₂ reduction project, significantly reducing spend through new data collection and visualisation, benchmarking, and optimisation.
Designed and implemented the foundation for a new cloud provider, configuring how Shazam manages our platform infrastructure, permissions, networking, certificates, DNS, and load balancing.
Improved our centralised monitoring services to scale to more than 10k nodes, 100k pods & 120+ clusters across restrictive cloud boundaries, and improving overall fault tolerance.
Extended, updated and migrated a legacy Go service serving ~5k RPS and a 15B-row MySQL database to support new features without downtime.
Owned, migrated, and maintained our distributed secrets, dynamic credentials and certificate services (Vault and cert-manager among others).
Redesigned and implemented our distributed tracing for our greenfield and legacy recognition infrastructure.
Led the migration to multi-arch container builds, and multi-arch GKE clusters to improve performance and to speed up local development environments.
Detected API abuse and implemented rate limiting for the first time in Shazam while working with external cloud providers to mitigate impact.
Migrated all of our Puppet 6 IaC to Puppet 7 and from CentOS to Debian.

More recently, I have spent 3 months as a Go software developer on rotation with our recognition infrastructure team in San Diego, working on developing a new API and Envoy-based routing for a greenfield project. As part of this, I got to work closely with our developers and research team to define new core business logic, implement cross-region routing with Envoy for automatic failover, and upgrading our tracing libraries to be more detailed and support Open Telemetry.

2019-2021 Site Reliability Engineer: Shazam Platform team

In this role my focus was on platform development and on-call for platform related issues. This role was broad-ranging, from maintaining legacy systems, creating and managing our fleet of 120+ GKE Kubernetes clusters across 8+ regions, and our self hosted software: GitLab, Grafana, Prometheus, Vault, Splunk and more.

Areas of Focus

GKE managed Kubernetes
IaC: Terraform & Terragrunt
Monitoring: Prometheus, Thanos
Alerting: Alertmanager & PagerDuty
Secrets: Hashicorp Vault

During my time I redesigned how we manage our Kubernetes clusters Vault & Thanos metrics infrastructure to handle our growing workloads, deployed and managed external-secrets, external-dns, Goldilocks for VPA recommendations, gatekeeper and cert-manager to improve and simplify infrastructure management. In this role, I was also responsible for the Google Cloud organisation as a whole, the creation and management of our networks, DNS, proxies and cross-cloud VPNs over to our AWS infrastructure.

I led a variety of projects including: moving our workloads to dynamic credentials with Vault, creating our own CAs and migrated our services to use cert-manager to eliminate certificate rotation toil, refactoring the structure of our GCP org to improve ease of management and security, implementing org policies to enforce security best practices, working with our GPU infrastructure team to implement tracing and metrics in C++ and Go for monitoring application start times and detecting hardware issues.

As part of this role I also helped to train others on technical aspects of systems we use, develop tools and dashboards to help developers understand and troubleshoot their workloads. I have also hosted security focused hack days to widen our security knowledge across our teams. We also regularly hosted individuals on career experiences, and interns to mentor and provide them with projects to give them an experience of SRE work.

Significant projects:

Redesigned and migrated our Kubernetes infrastructure management to a new Terraform and Terragrunt structure.
Reduced Kubernetes cluster creation time from days to hours through process optimisation and automation.
Created an automated process to detect and alert on future K8s upgrade issues with Prometheus metrics, Pluto and Kubent to report outdated resources.
Created our DNS policy, and reduced our public DNS record footprint by 70% (8.5k -> 2.5k) to improve security, and implemented automated creation & deletion of records in K8s with external-dns.
Developed an automated DNS inventory system which alerts us to misconfigurations to end DNS-related bug bounties.
Redesigned our incident management process from scratch with our engineering managers and helped developer teams adopt the new process.
Migrated Splunk infrastructure from 30+ hand created VMs into a single horizontally scalable K8s pod.
Improved our security posture by rolling out GKE workload identity, creating a bi-weekly security review process, and setting up Anchore to scan all container builds to detect vulnerabilities before they reach production.

Cardiff University

2015-20 PhD Computer Science

New Methods in Quantification and RF Pulse Optimisation for Magnetic Resonance Spectroscopy

PhD in Computer Science, specialising in quantum control & distributed computing for MRI/NMR. Developed a parallel computing framework and applied several deep learning methods for spectral analysis. To help with the compute needs of the project I self-funded a 12 node blade server and self managed it as an HPC cluster in the university data centre. During this time, I also tutored undergraduates including a year-long group project focused on developing a product in a simulated professional team environment. I also taught Python, Java, C and web development as well as being a post-graduate student representative.

Skills Gained

Research
Teaching & tutoring
Distributed computing
Optimal control
Linux system administration
Python & Matlab

Education

2015-20 Cardiff University PhD Computer Science (supervisor: Frank Langbein)

2012-15 Cardiff University BSc Computer Science (First class honours)

Hobbies

Ultra marathon running
Mountaineering
Cycling
Coffee
Podcasts
Photography
Climbing
Homelab (ma.ax)

Awards

2015 Engineering and Physical Sciences Research Council (EPSRC) full postgraduate scholarship for PhD

2012 Cardiff University Research Opportunities Placement (CUROP) undergraduate summer research project stipend

References

Available upon request.