Oncall shift should be Tuesday to Tuesday
Summary
Normally when developers/SWEs/SRE/IT are oncall for a project, the oncall is from Monday of one week to Monday of the following week. This is sub-optimal. Instead it should be Tuesday to Tuesday. This is a 0 cost improvement to everyone’s quality of life and improves schedule accuracy.
Background
Stuff goes wrong with software. There are bugs, unexpected surge’s in traffic and edge cases you didn’t consider. But websites need to be up 24/7, cron jobs need to run on the weekend and backend servers need to be up to support both. So someone needs to get paged to look into it when something goes wrong. Additionally depending on the size of your company you have ad hoc questions coming in about the system. What this means is that if you have a live system and a team of developers every week one of them has the primary job of making sure the system stays up and that the other developers don’t get bothered, This is called oncall/on call/on-call. Whatever they were working on is on pause, while they handle the above issues.
Additionally all production systems have ‘scut’ work that needs to be done.
- Library dependencies need to be upgraded,
- servers need rebooted/patched
- Once a week tasks exist that’s not worth automating
- Config changes
- High priority requests
Depending on how the team is organized some of this will be part of the oncall and some of this will be handled in your normal task workflow.
As an aside this isn’t as terrible as I’ve made it sound. I’ve been oncall on a 8 to 10 teams and while I’ve had a dozen pages in the middle of the night they were mainly been when I worked at a startup. Most places take after hours paging pretty seriously.
In a launched system paging should be infrequent
The last important aspect to know about oncalls for software systems is normally there’s a handoff meeting. Either with the entire team or just the current oncall and the next one. So if Alice was oncall and Bob is next at minimum the two of them will meet for 30 minutes to discuss the issues that might still be ongoing from last week and anything that Bob might need to know about.
Proposal
With that background a seemingly minor point is when does the oncall shift start. Since the work week starts on a Monday usually people schedule it Monday to Monday. It’s better to instead to have the oncall be Tuesday to Tuesday and also have the hand off meeting on Tuesday
Reasons
Better for Holidays
In the U.S. three day weekends are fairly common, and in 2024 seven1 occurred on Monday. By making the oncall switch be on Tuesday instead of a Monday only one person has there 3 day weekend interrupted instead of two. Additionally you don’t need to move the oncall handoff meeting to Tuesday every 3 day weekend
Better followup of weekend issues
When a major issue does happend over the weekend there’s usually two parts of handling it
- getting the problem fixed
- making sure it doesn’t happen again
So when a major issue happens over the weekend only Step 1 happens during the weekend. Step 2 involves following up with other teams, creating new alarms and updating the runbook. And all that usually happen during the week. The oncall is going to spend at minimum their Monday doing that so it’s better if the schedule reflects that. It also means that smaller issues are much more likely to be handled by the current oncall instead of being handed off to the next oncall, since they have a full day to deal with any minor issues that came up over the weekend.
Better for weekly tasks
Similarly if the oncall has to prep for the handoff meeting they are going to be spending Monday morning doing that. So it’s better to have the schedule reflect reality. Additionally it means the oncall can have a set time to take on for any scut work that need to be done on a weekly basis.
Counter arguments
Our sprint starts on Monday
Highly scrum/agile focused people have brought up that sprints start on Monday and that starting the oncall on Tuesday makes sprint planning harder. My counter arguement is that a single day difference is well within the variance/accuracy you should expect of your sprint planning.
I’ve had python package dependencies that take longer to fix than a day
More diplomatically the prior oncall is likely doing work on Monday no matter what. This is just making the schedule reflect reality better. Additionally you can also adjust the sprint to end on a Tuesday.
Rest of org does Monday to Monday
Within a large enough team or product there may be multiple oncall rotations. When there are multiple oncall rotations and they change on different days, it can be confusing during cross team launches or major issues. I have seen both of these happen but in effect it means looping two oncalls in from the other team instead of one, which isn’t a large burden. Additionally once I explained the arguments here to the other team, they also switched to Tuesday to Tuesday. So share this article with them to avoid it.
-
New Year’s Day, MLK Jr Day, Presidents’ Day, Memorial Day, Labor Day, Columbus Day, and Veteran’s Day ↩