June 2008
33 posts
Monday morning downtime
We’re seeing some issues around our load balancer and a few hung machines. Service will be slow until these come back up to full speed.
Update (11:35a): We’re still working through this problem.
Saturday night downtime
The site has been impacted by a database problem (unrelated to the @ replies work we performed today). We’re working right now on recovery.
Update (10:14p): And we’re back. Total downtime was less than 15 minutes.
@ Replies back on the web
The team worked today to bring the @ replies tab back to the web site. We’re watching the impact very closely and plan to turn the API methods back on Sunday.
@ Replies status
We’re still working on bringing the Replies tab and feed back. Unfortunately it will remain off for tonight as we continue to work on the stability of this feature.
Friday Morning
We’re currently experiencing massive slowness and downtime across the web site. Our data center has been notified and is looking into it.
Update (9:53a): Our load balancer is overloaded. We’re working to bring web machines back up to service requests now. You’ll be seeing whales for a bit.
Update (10:20a): Web services are coming back online and we’re starting to serve...
@ Replies update
We’re still working on the restoration of viewing replies (both via the API and on the web). This functionality will remain off tonight as we continue to work through the underlying problem.
In the meantime, we recommend checking out Summize. You can run queries like “to:username” or “@username” to see replies directed at you.
Thursday Morning
This morning we’re seeing slow page loads and many a whale. We’re deploying several fixes soon which we believe will alleviate the more onerous queries and will allow us to bring back the replies tab.
Update: While we work on this problem, replies will be disabled via the API as well as on the web.
Replies tab disabled
The replies tab remains disabled today as we rework some of the queries that were causing problems yesterday. This has also been reflected in the sidebar of the status blog for the web features.
One way you can see replies directed to you is to search on Summize. You can search for “to:username” to see all updates directed at you.
Tuesday morning
The site has been slower this morning than in the past few days. Some users are also seeing error pages.
Update (6:50p): We’re still working through this issue.
Update (7:30p): We’ve identified some bad queries relating to favorites and replies. We’re disabling those pages until we put a fix in place.
Friday report
As you can see from our public Pingdom report, we had 3 minutes of downtime about an hour ago. Despite this, our situation is largely improved today. We’ve had our lowest page load times since last Sunday and our uptime for the week has been 99.9%.
Our latency is still not as low as it should be (> 1sec is not good) and folks were running into error pages a lot yesterday. There’s...
Over capacity errors
As reflected in the sidebar of this blog, we’re seeing a lot of over capacity errors and long load times. We’re working on this problem.
Update: We’re still investigating this issue. We are temporarily reducing the API rate limit to 20 requests per hour in order to help address the latency issues we’re seeing.
Status blog updates
We moved status information about key systems to the sidebar of the blog. You can hover over the individual components to get the latest information from those items alt text. (We’re going to be making additional refinements to the display of this information).
Yesterday we deployed a fix for the Facebook application problem. Should be working better now.
Site slowness and error pages
We’re seeing some increased site slowness this evening. This can also manifest itself as error pages. We’re working to resolve this.
End of Week update
Status of systems is listed below. While we still have a number of systems that need attention, our public uptime report shows that we’ve had 99.6% uptime since last Saturday (including the Steve Jobs keynote on Monday). We’ve got a lot more work to do but we feel we’re making progress
web features: all systems ok
SMS: all systems ok (we are investigating some problems with...
Odd whales
We’re seeing a number of whales pop up around the site, especially on profile pages. We’re aware of the issue and working on it now.
Update: site back up and mostly whale free.
Thursday status update
Still working on some improvements to deal with spiky load in the morning hours. Here’s a summary of our systems:
web features: all systems ok
SMS: all systems ok (we are investigating some problems with error messages being sent)
user restoration/deletion: all systems ok
person search: all systems ok
pagination: most users will be able to paginate back through older updates on the...
Wednesday update on service status
We’ve seen increased load the past couple mornings, but the site has been largely stable. We do have several services that we are still working to restore, so I wanted to update you on the status of each:
web features: those disabled during the Steve Jobs’ keynote were all restored during, or shortly after it ended on Monday
user restoration/deletion: this service was restored during...
Experiencing a network problem
There is an issue at our data-center that is affecting the service. We’re working to resolve it as quickly as possible.
Update: this has been resolved.
Bringing a few features back
We’re still capping the API at 10 requests per hour during the Stevenote, but we’ve just brought the replies tab back. Other updates here as we restore functionality. We expect the site to run a bit slow for a while, but we think we can bring back some of these features without significantly impacting performance.
Update (10:22a): sidebar pictures have been restored
10:28a: public...
Some elements of the sidebar temporarily disabled
We’ll be turning these back on later this afternoon.
Update: The replies, everyone and archive tabs have also been temporarily disabled and will be restored later this afternoon.
The API request limit has been temporarily dropped to 10 per hour. Please configure your clients not to pull more frequently than once every 6 minutes.
Sunday night update
User account restoration and deletion is currently disabled. We expect these services to be brought back tomorrow afternoon.
In addition, IM is still offline as we work to bring it back in a more stable form. The API rate limit remains at 30 requests per hour.
Some users are still reporting some timeline inconsistency issues and we’re working to track those down.
DB fail over tests success
We tested an evolution of our process for recovering from a crashed database today.
There was no visible impact to users or via any of our monitoring mechanisms, and we were very happy to beat our expected times for this operation.
Testing our DB fail-over practice
For the next hour or so we will be testing our DB failover practice to ensure it works as planned.
The site may be unreachable for a few minutes during this testing although we don’t expect any appreciable effect to end users.
Track via SMS restored; timeline oddities
Track via SMS has been restored. As a reminder, you can check to see what terms you are tracking by texting track to Twitter. To track a new term, send, for example, track iphone. You can read more about track in Help.
Some users are seeing inconsistencies in either their timelines or the timelines of their friends. From time to time some, older updates may appear missing. These updates have not...
Looking a bit better after a rough patch
Since a deploy this afternoon, site latency has been looking a lot better. We’re still vulnerable to particular database problems and there are additional enhancements we want to make. We’ll be working on these issues tonight and tomorrow.
IM continues to be down but we’ve also made progress on that front; we’ve tested a lot more of the technology we need in order to...
Friday morning DB problem
We’re experiencing a database problem similar to the one that affected the site last night. Working on the recovery now.
Lost a database
We just lost a database about 5 minutes ago and this has severely impacted the site.
We’re working on recovering from this now.
With friends tab and feeds
Recently we made a change to remove the With Friends tab from user profiles. We did this after finding out that this tab was both a relatively rarely accessed as well as computationally expensive page for us to serve.
At the same time we removed access to the feeds for this page. It’s still possible for users to receive their own With Friends timelines, but authentication is required. What...
Things looking up -- mostly
In the last 36hrs, we’ve had 99.4% uptime (according to Pingdom), which is not where we want to be, but is a heck of a lot better than the couple days before. At the same time, average page response time has been substantially reduced. And the number of updates and other key metrics are significantly higher. (We had more visits to the site than ever in our history yesterday.)
So some of the...
Monday update
With the exception of some spiky load this morning, today has been largely stable. We believe the changes we made over the weekend have had a postive effect on the overall stability of the service. Next up: the restoration of IM - this is still our top priority to fix.
Databases back online
Services may be slow or inconsistent for the next short while things come back to life. You may intermittently see “Something is technically wrong” pages.
Database maintenance
We’re going to make a change to the databases which will take the service down for a bit. We’ll keep you updated here. Thanks for your patience!
Sunday! Sunday! Sunday!
We’re gathering today to make some minor changes on our slave databases (shortening a few tables) to help speed queries. You may notice some extra slowness while we bring the modifications into production. We’ll get speedy again after the databases warm to the new queries. This is an effort to reduce overall web and API load time and clear the area so we can bring IM services online...