Software Development Resources
Create account | Log in | Log in with OpenID | Help

Programming for database replication

From DocForge

Historically, data was often stored on large individual computers. Today it's common to have databases copied or spread across multiple servers. Benefits include cost reductions, redundancy, and scalability. This typically adds at least a small level of complexity to the applications which need to access these multiple databases.

Contents

[edit] Web Applications With Master/Slave Replication

Special considerations must be made when programming for master/slave database replication. The master database(s) receive data updates directly from the application. The slave database(s) receive data updates only from the master(s). All statements which alter the database in any way, including insert, update, delete, and replace SQL statements, must go to a master database. The changes will be sent to slaves, usually after some small delay. For very busy databases a delay of a second or more will mean the matching data will not be available on a slave database during one page load.

There are multiple techniques for handling the issues associated with web applications using master/slave replication. All solutions require database connectivity and queries to pass through a common set of code, typically one application module or class. Some of these methods overlap in functionality or can be combined together.

[edit] Connections Based on Page

Often specific web pages perform database updates while others never update any data. In this scenario the pages which may update data can be set to always connect to only the master server, while the rest of the site connects only to slave servers. To optimize the number of database connections from web servers to database servers it's usually best that each type of page (those that perform updates and those that don't) each be handled by separate web servers. This lets a subset of the web server pool open connections to any database.

Benefits:

  • Simplicity
  • Never more than one database connection per page execution
  • Separation of database connection pools between sets of web servers

Limitations:

  • If pages requiring updates are hit very often the master may take heavy load while the slaves aren't fully used.
  • Users who perform updates may not see their changes immediately. For example, if after a data update a user is redirected to a read-only page containing the data, any small delay in replication may cause the new data to not appear immediately on the read-only page.

[edit] Functional Offloading

Taking connections based on page to the next level, subsections of a site or specific functionality may best be served by dedicated slaves, taking load off of other slaves and master servers. One of the most common scenarios where this applies is reporting. Very often application reports do not need to be absolutely instantaneously updated. A delay of a few seconds or more is not important. Yet the queries required for reporting are often quite complicated and expensive to execute. Therefore having slaves dedicated to reporting or some other less time-sensitive functions can make a significant overall performance improvement.

There are a few methods to achieve this separation of functionality. The web site pages which handle this special functionality can be directed to only use a subset of slave databases. Those pages can be placed on separate web servers to segment database connection pooling. Alternatively, the specialized functionality can be built as a separate application server or component, which is then called from any web server which requires it. This method increases modularity and would allow different front-ends to share the same features.

Benefits:

  • Critical functionality isn't impaired by heavy use of secondary functionality.
  • Database slaves and web servers used only for reading can easily be added to improve performance.
  • This setup can be used in non-web specific scenarios, such as rich client applications, making it very helpful in an environment with mixed applications.
  • The offloaded functionality can be built with technologies that are most efficient at the task, sometimes different than those used for the front-end.

[edit] Dynamic Server Selection

When all queries are passed through a common module, library, or class, select statements can all be sent to slaves while other statements are sent to masters. This will require two database connections, but the actual number of connections can sometimes be lowered by only opening connections as needed. A typical issue with this method is with future select statements which require data that was updated earlier in the same page load. It's possible the new data has not made it to the slave during the page execution. Therefore it's often desirable to send all queries to the master following any updates.

In large traffic scenarios this method can often cause a performance hit: too many database connections from an individual server. The number of open connections from each web server to each database is critical to performance. And this setup can double the number of connections from each web server. Therefore this scenario is often best used in combination with load balancers and careful control over which pages are served by each web server.

Benefits:

  • Databases used as needed
  • Application logic can be written without regard to any database replication or particular connections.

Limitations:

  • Multiple database connections per page execution
  • Excessive database connections from a web server can cause more performance issues than query load
  • select statements still sometimes going to master database

[edit] Slave Rotation

Sophisticated setups can use a load balancer to automatically choose a slave server to connect to. But a special load balancer is not an absolute requirement to distribute load across slaves. A pseudo-random number can be used to pick a slave. This, however, is often not ideal because database server load is not considered and tweaking of query distribution based on need is not possible. But for smaller setups this can be a useful quick solution.

Benefits:

  • Simple implementation
  • Different database server pools can be used per web server and the database usage within them easily rotated at the application level.

Limitations:

  • Database load is ignored. A proper load balancer is needed to compensate.
  • In a LAMP environment, connections between PHP and MySQL are typically persisted. Therefore one server instance connecting to different databases on separate page loads will cause multiple connections to persist. It helps if connections are dropped immediately after they're needed.

[edit] Multi-Master Replication

In an environment with multiple master relational databases, each master must be able to take updates while also syncing correctly with others. One common concern is that unique identifiers must not conflict across master databases. Therefore server features such as table level auto-increment can not be used unless each table is only updated on one master database, or each server is given a distinct range of available auto-increment values.

One solution is for each database server or client to generate a UUID, which is a standard global unique identifier, or GUID. A UUID is a string which is almost guaranteed to never conflict with another UUID generated on a different computer. Therefore each master database, or each client connecting to each database, can generate a UUID and use it as the primary key on a table without much concern for replication conflicts. The only downside to this choice is that every primary key is then a string, which can be detrimental to performance on some systems, and generating a massive amount of UUIDs can be expensive to process.

Another solution which will work with any relational database server is to create ranges of unique integers specific to each master database. For databases which don't have GUID generation features, one method is to create a GUID table which stores a server identifier and GUID as an integer. As the application requests an ID for a new record, the GUID for the connected master is incremented. For example, database master1 can start with GUID 1 and master2 can start with 1000000. As IDs are requested on master1 the GUID table value is incremented. Those IDs are then guaranteed to be unique when replicated to other masters. The only maintenance concern is the eventual overlap of GUIDs. Therefore, when master1 reaches 999999 its GUID needs to change to 2000000 and master2's needs to change to 3000000. When any master reaches its limit, the values for each server are incremented by the pre-determined range. A simple algorithm for GUID retrieval, update, and maintenance can be written in a stored procedure or in common database code at the application level. To prevent conflicts the application must always request IDs for new records by requesting a GUID and using it during insert statements.

Benefits:

  • Large volumes of updates can be spread across multiple databases.
  • Master databases can be segmented by functionality, improving stability. If one master dedicated to specific functionality goes down, only that functionality is blocked.

Limitations:

  • Data consistency across multiple masters is harder to maintain than slaves. Slaves can simply be cleared and re-synced while doing that to a master will cause a loss of data.

[edit] See Also

Discuss