Why does quora use mysql




















Schemas allow the data to persist in a typed manner across lots of new versions of the application as it's developed, they serve as documentation, and prevent a lot of bugs. And SQL lets you move the computation to the data as necessary rather than having to fetch a ton of data and post-process it in the application everywhere. I think the "NoSQL" fad will end when someone finally implements a distributed relational database with relaxed semantics. Interview Preparation III.

CassandraSF Progress and Futures. Thank you so much for this post, and for satisfying some of my curiosity. Really impressive breakdown of their stack.. I thought it was php driven considering the Facebook legacy. I hope they open source their search.. Thanks Phil for this comment. I finally went back to it and break it down. This is brilliant and inspiring.

Scaling up is getting easier but setting up this kind of architecture requires a lot of learning and time. They planned for scaling and performance before they built their tools and architecture. They understand MySQL very well and how they can use it as the basis for a scalable data-store.

They seem to also know when the limitations of this architecture will be reached and I have no doubts that they have plans in place for migrating the architecture, where necessary, to different technologies.

That said, they will likely encounter some unique challenges in scaling that others have not encountered before. This is due to the unique way that they have created Quora. Friendfeed is the closest website structure I can think of off the top of my head, but I expect Quora to take the scaling to a whole new level.

LiveNode and webnode2 show that they are not affraid of building their own solution where one does not exist. I must smile a little about the surprised reactions on the scale-at-the-application thing. This is so commonly used in the industry long before shared caches came around. Each node could have its own database locally, even MySQL. Batch processes running during the less active hours of the service redistribute subscriber data if needed. The web-browser keeps the connection with the server alive using HTTP keep-alive.

It let me know a lot. Great article. Any thoughts? Software Engineer Phil Whelan on Twitter. The Search-Box Only the questions, topic labels, user names or post titles are indexed and served up to the search-box.

Speedy Queries Did I mention that the search-box is fast? Nginx Behind the load-balancer, Nginx is used as a reverse-proxy server onto the web-servers.

Pylons And Paste Pylons , a lightweight web framework, is used as their main web-server behind Nginx. Thrift Thrift is used for communications between backend systems. Tornado The Tornado web framework is used for live updating. Long Polling Comet Quora does not display just static web pages. Git Git is used for version control. Resources Quora.

I look forward to more! Alex Toulemonde. See U soon. Phil Whelan. Good point, Alex. The WHEN will have to wait for another post. Stay tuned. Jon Forrest. Horia Dragomir. Scala is too young. Adam mentioned that speed and lack of type checking are Python's weaknesses, but they all already know that the language is pretty good. They think Ruby and Python are very close, but they have Pyhton experience and lack of experience with Ruby, so Python wins. They are using Pyhton 2.

Another advantage of using Python is that Python's data structure and JSON can be well mapped, and the code is highly readable. And there are many libraries, debuggers and overloaders. The communication between the Quora browser and the server mainly uses the JSON format, so this has also become an important factor.

Without using an IDE, most developers use the Emacs text editor. Obviously this is a personal choice, and as the team grows, it may change. In addition, they mentioned PyPy , a project that makes Python faster and more flexible.

Thrift is used for communications between backend systems. Thrift is used for communication between back-end servers. Facebook also uses this technology. Mainly if you want to keep the data between requests in memory and make your Pyhton code stateless. You need to write a Python package that encapsulates a C library, which involves some memory management based on reference counting.

This requires someone who knows the Python kernel, but writing a thrift interface is very simple. At the same time, this way, you also isolate the exception, so as not to crash the Python code. The Tornado web framework is used for real-time updates. This is their Comet server, used to handle a large number of network connections that require long poll and push updates. Quora displays more than static pages.

When other people submit new content like questions, answers, comments, each page must be updated. As Adam D'Angelo said, the best technique to deal with this situation is "long polling".

Unlike "polling", in traditional "polling", the browser needs to repeatedly send the request to the server: "Are there any updates? This mode uses the browser as the driver. This is a drawback because the browser does not know how long to wait before polling. If the browser polls too frequently, it will excessively increase the load on the server.

If the polling frequency is not enough, when there is new content, the server can only wait for the client to poll, and the user will not be able to see the updated content in time.

The session between the client and the server is the same, the difference is that the client can only wait before doing the next poll, and now the server is waiting to respond.

The server keeps connecting for a long period of time such as 60 seconds while waiting for an update to arrive. When the update comes, the service can respond immediately. When the client receives the update response, it can immediately initiate a new request. The server delays the response again until there is an update to return or the time expires and there is no response.

The advantage of long-polling is that it reduces the number of front-end and back-end interactions. Let the server control the time, so the content update may be just a few milliseconds. This also makes it used to handle chat applications or those that really need real-time updates.

However, the disadvantage of this model is that this will cause a large number of TCP connections to appear on the server side. Think about it, Quora is almost an application for millions of users.

Note that if a user opens multiple Quora web pages in their browser, then this linker can be very fatal. Of course, the good news is that there are already some technologies specifically designed for Long Polling. These technologies allow you to consume very very little memory in those waiting connections because those waiting for connections do not need all the resources. For example: Nginx is a small single-threaded event-driven server, and each link consumes very little memory.

Each Nginx process will only handle one connection at a time. This means that it can be easily expanded into a service architecture that can handle thousands of concurrent volumes. Is there any way to do this without having the client constantly polling the server for updates?

Adam D'Angelo, Quora Sep 29, There is no reliable way to do this without having the client polling the server. However, you can make the server stall its responses 50 seconds is a safe bet and then complete them when a message is ready for the client.

If you have a specialized server that uses epoll or kqueue, you should be able to hold on the order of k users per server depending on how many messages are going. In the answer to a question on Quora, " When Adam D'Angelo says" partition your data at the application level ", what exactly does he mean?

The most basic recommendation is to partition the data as needed. You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica.

If you have a layer of memcached servers in front of the databases which are easy to scale out then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.



0コメント

  • 1000 / 1000