Oncall shift should be Tuesday to Tuesday
Summary
Normally when developers/SWEs/SRE/IT are oncall for a project, the oncall is from Monday of one week to Monday of the following week. This is sub-optimal. Instead it should be Tuesday to Tuesday. This is a 0 cost improvement to everyone’s quality of life and improves schedule accuracy.
Background
Stuff goes wrong with software. There are bugs, unexpected surge’s in traffic and edge cases you didn’t consider. But websites need to be up 24/7, cron jobs need to run on the weekend and backend servers need to be up to support both. So someone needs to get paged to look into it when something goes wrong. Additionally depending on the size of your company you have ad hoc questions coming in about the system. What this means is that if you have a live system and a team of developers every week one of them has the primary job of making sure the system stays up and that the other developers don’t get bothered, This is called oncall/on call/on-call. Whatever they were working on is on pause, while they handle the above issues.
Additionally all production systems have ‘scut’ work that needs to be done.
- Library dependencies need to be upgraded,
- servers need rebooted/patched
- Once a week tasks exist that’s not worth automating
- Config changes
- High priority requests
Depending on how the team is organized some of this will be part of the oncall and some of this will be handled in your normal task workflow.
As an aside this isn’t as terrible as I’ve made it sound. I’ve been oncall on a 8 to 10 teams and while I’ve had a dozen pages in the middle of the night they were mainly been when I worked at a startup. Most places take after hours paging pretty seriously.
In a launched system paging should be infrequent
The last important aspect to know about oncalls for software systems is normally there’s a handoff meeting. Either with the entire team or just the current oncall and the next one. So if Alice was oncall and Bob is next at minimum the two of them will meet for 30 minutes to discuss the issues that might still be ongoing from last week and anything that Bob might need to know about.
Proposal
With that background a seemingly minor point is when does the oncall shift start. Since the work week starts on a Monday usually people schedule it Monday to Monday. It’s better to instead to have the oncall be Tuesday to Tuesday and also have the hand off meeting on Tuesday
Reasons
Better for Holidays
In the U.S. three day weekends are fairly common, and in 2024 seven1 occurred on Monday. By making the oncall switch be on Tuesday instead of a Monday only one person has there 3 day weekend interrupted instead of two. Additionally you don’t need to move the oncall handoff meeting to Tuesday every 3 day weekend
Better followup of weekend issues
When a major issue does happend over the weekend there’s usually two parts of handling it
- getting the problem fixed
- making sure it doesn’t happen again
So when a major issue happens over the weekend only Step 1 happens during the weekend. Step 2 involves following up with other teams, creating new alarms and updating the runbook. And all that usually happen during the week. The oncall is going to spend at minimum their Monday doing that so it’s better if the schedule reflects that. It also means that smaller issues are much more likely to be handled by the current oncall instead of being handed off to the next oncall, since they have a full day to deal with any minor issues that came up over the weekend.
Better for weekly tasks
Similarly if the oncall has to prep for the handoff meeting they are going to be spending Monday morning doing that. So it’s better to have the schedule reflect reality. Additionally it means the oncall can have a set time to take on for any scut work that need to be done on a weekly basis.
Counter arguments
Our sprint starts on Monday
Highly scrum/agile focused people have brought up that sprints start on Monday and that starting the oncall on Tuesday makes sprint planning harder. My counter arguement is that a single day difference is well within the variance/accuracy you should expect of your sprint planning.
I’ve had python package dependencies that take longer to fix than a day
More diplomatically the prior oncall is likely doing work on Monday no matter what. This is just making the schedule reflect reality better. Additionally you can also adjust the sprint to end on a Tuesday.
Rest of org does Monday to Monday
Within a large enough team or product there may be multiple oncall rotations. When there are multiple oncall rotations and they change on different days, it can be confusing during cross team launches or major issues. I have seen both of these happen but in effect it means looping two oncalls in from the other team instead of one, which isn’t a large burden. Additionally once I explained the arguments here to the other team, they also switched to Tuesday to Tuesday. So share this article with them to avoid it.
Alternatives
2024-01-04 updates: Reddit and HN comments presented some alternatives that are worth discussing.
Critical services
There are certain services that are high enough priority that even a down time of 5 minutes can lose millions of dollars and a very unflattering Bloomberg article will be written about your stock price. Think Google homepage, AWS services, Starlink networking, etc. Different companies call them things like Critical Systems/Tier 0/Priority 0 etc. In my experience working with teams supporting Critical Services they all seem to have custom oncall setups. Normally 8 or 12 hour shifts across multiple time zones. This article doesn’t apply to them because it’s less handling a page and more hovering over a computer for 12 hours.
24 hour oncall
Two commentators brought up that oncall should only be 24 hours and with no handoff meeting. Both appeared to be SREs and primarily thinking about Critical Systems. For regular systems this would only work as long as the paging and ad hoc questions, even during business hours, were very rare. Otherwise the developer would not be able to work on their normal project uninterrupted. This would also mean that none of the ‘scut’ work takes more than 24 hours, otherwise it would need to be handed off. I personally wouldn’t like this because you would never have a multiweek time where you weren’t oncall.
Oncall should be Wednesday/Thursday/Friday to Wednesday/Thursday/Friday to
A few people brought up that they do oncall handoff on Wednesday, Thursday or Friday. Wednesday and Thursday are fine. I settled on Tuesday because it is an easier mental shift for people used to handoff on Monday. Friday is bad for two reasons. First if any details were missed in the oncall handoff the new oncall cannot ask the old oncall until Monday. Second it also requires moving the handoff for holidays that land on Friday.
-
New Year’s Day, MLK Jr Day, Presidents’ Day, Memorial Day, Labor Day, Columbus Day, and Veteran’s Day ↩