Sunday, June 29, 2014

Clustering and Load Balancer

Clustering and Load Balancer:

A cluster is a group of application servers that transparently run your J2EE application as if it were a single entity.
Clustering means you run a program on several machines (nodes). One reason why you want to do this is: Load balancing.
If you have too much load/work to do for a single machine you can use a cluster of machines instead. A load balancer
then can distribute the load over the nodes in the cluster.
More and more mission-critical and large scale applications are now running on Java 2, Enterprise Edition (J2EE).
Those mission-critical applications such as banking and billing ask for more high availability (HA), while those
large scale systems such as Google and Yahoo ask for more scalability. The importance of high availability and
scalability in today's increasingly inter-connected world can be proved by a well known incident: a 22-hour service
outage of eBay in June 1999, caused an interruption of around 2.3 million auctions, and made a 9.2 percent drop in
eBay's stock value.

Clustering Setup can be done either at the request level or session level. Request level means that each request may go
to a different node - this is ideal since the traffic would be balanced across all nodes, and if a node goes down, the
user has no idea. Unfortunately this requires session replication between all nodes, not just of HttpSession, but ANY
session state.

Session level clustering means if your application is one that requires a login or other forms of session-state, and one
or more your Server nodes goes down, on their next request, the user will be asked to log in again, since they will hit a
different node which does not have any stored session data for the user.

This is still an improvement on a non-clustered environment where, if your node goes down, you have no application at all!
And we still get the benefits of load balancing, which allows us to scale our application horizontally across many machines.

Basic Terminology:

In some large-scale systems, it is hard to predict the number and behavior of end users. Scalability refers to a system’s
ability to support fast increasing numbers of users. The intuitive way to scale up the number of concurrent sessions
handled by a server is to add resources (memory, CPU or hard disk) to it. Clustering is an alternative way to resolve the
scalability issue. It allows a group of servers to share the heavy tasks, and operate as a single server logically.

High Availability:
The single-server’s solution (add memory and CPU) to scalability is not a robust one because of its single point of failure.
Those mission-critical applications such as banking and billing cannot tolerate service outage even for one single minute.
It is required that those services are accessible with reasonable/predictable response times at any time. Clustering is a
solution to achieve this kind of high availability by providing redundant servers in the cluster in case one server fails
to provide service.

Load balancing:
Load balancing is one of the key technologies behind clustering, which is a way to obtain high availability and better
performance by dispatching incoming requests to different servers. A load balancer can be anything from a simple Servlet
or Plug-in (a Linux box using ipchains to do the work, for example), to expensive hardware with an SSL accelerator embedded
in it. In addition to dispatching requests, a load balancer should perform some other important tasks such as
“session stickiness” to have a user session live entirely on one server and “health check” (or “heartbeat”) to
prevent dispatching requests to a failing server. Sometimes the load balancer will participant in the “Failover” process,
which will be mentioned later.

Fault Tolerance:
Highly available data is not necessarily strictly correct data. In a J2EE cluster, when a server instance fails, the service
is still available, because new requests can be handled by other redundant server instances in the cluster. But the requests
which are in processing in the failed server when the server is failing may not get the correct data, whereas a fault tolerant
service always guarantees strictly correct behavior despite a certain number of faults.

Failover is another key technology behind clustering to achieve fault tolerance. By choosing another node in the cluster, the
process will continue when the original node fails. Failing over to another node can be coded explicitly or performed
automatically by the underlying platform which transparently reroutes communication to another server.

Idempotent methods:
Pronounced “i-dim-po-tent”, these are methods that can be called repeatedly with the same arguments and achieve the same results.
These methods shouldn’t impact the state of the system and can be called repeatedly without worry of altering the system.
For example, “getUsername()” method is an idempotent one, while “deleteFile()” method isn’t. Idempotency is an important concept
when discussing HTTP Session failover and EJB failover.

Issues in implementing Clusters:
1) Static Variables: When an application needs to share a state among many objects the most popular solution is to store the
state in a static variable. Many Java applications do it, and so do many J2EE applications -- and that's where the problem is.
This approach works absolutely fine on a single server, but fails miserably in a cluster. Each node in the cluster would maintain
its own copy of the static variable, thereby creating as many different values for the state as the number of nodes in the cluster.
Hence if one node updates the value, other nodes will not get updated value. Solution is to use a kind of ClusterNotifyManager which
will notify all clusters if a common such variables are changed. ClusterNotifyManager will work for all other kind of updates too.
Ex: a Database parameter like Application Version, SessionTimeout, UserExpiryTime etc which is used by the whole application is
changed from User Interface, then it needs to be communicated to all the server nodes. Cluster Notification Manager can be designed
as an application running on a IP and Port and all other Server Nodes are listening to that Port as an handler. An event change
triggers the nodes and reloads their parameter values.

2) You may need to write a design pattern which will keep syncing of some "less write, more read" variables among all Server nodes
in an interval basis. As each Node keeps a cache of application variables they may not update if above mentioned "point 1" fails.