Wednesday, April 8, 2015

Databases for HA and DR

During a recent participation in a bid, responding to an RFP, I had to wear my ops hat and it was a refreshing change, from my dev endeavours in recent past.

One of the things, that I was required to do was to come with a solution that includes HA and DR. When considering the database options for HA and DR, I had to brush up my basics related to database replications, options and also discuss and validate with other senior tech architects who specialize in ops. Following a brief of the learnings that came out of the experience.

To know the basics of replication, the wikidpedia link for replication is very good. But first some basics.

Background: why database synchronization is required?
Scalability in the middle tier (for example from app servers) is easy to achieve through horizontal scaling, all you would need is mutiple app servers, all symmetrical and fronted by a load balancer.But in the database tier achieving scalability is lot tougher. Especially for data centric applications like OLTP, the database is the single most stressed tier. Though scale-up is an option, it does not provide for a good high availability option. Hence we have to think of 2 or more copies of database, to provide scalability as well as high availability.

Multiple databases can be a tricky option. We can have a cluster of databases all symmetric masters, fronted by a load balancer, but synchronizing data across the databases, is a huge overhead and has potential for failures such as deadlocks, during synchronization.

But to think up other options such a single master, multiple read-only slaves, one must appreciate that, not all data access is equal. If in our application, we can visualize data access as being divided into readonly and readwrite, we may find that only small percentage of application's data access is read write, say 30%. In such cases having a single master database which caters to read-write and multiple slave databases which cater to read-only can be resorted to.

In any case, having mutiple copies of the database and trying to synchronize data across those databases is something you will have to inevitably deal with, either for database load balancing, high availability(HA) or on a more remote note for disaster recovery (DR).

So what are the options for database synchronization?
There are 3 broad options for replication:

  • Storage based replication
  • File based replication
  • Database replication: this is usually supported natively by the specific database


Storage based replication
Active (real-time) storage replication is usually implemented by distributing updates of a block device to several physical hard disks. This way, any file system supported by the operating system can be replicated without modification, as the file system code works on a level above the block device driver layer. It is implemented either in hardware (in a disk array controller) or in software (in a device driver).

When storage replication is done across locally connected disks it is known as disk mirroring. A replication is extendable across a computer network, so the disks can be located in physically distant locations. For replication, latency is the key factor because it determines either how far apart the sites can be or the type of replication that can be employed.

Basic options here are synchronous replication, which guarantees no data loss at the expense of reduced performance and aysnchronous replication where-in remote storage is updated asynchronously hence it cannot guarantee zero data loss.


File based replication
File-based replication is replicating files at a logical level rather than replicating at the storage block level. There are many different ways of performing this. Unlike with storage-level replication, the solutions almost exclusively rely on software.

File level replication solution yield a few benefits. Firstly because data is captured at a file level it can make an informed decision on whether to replicate based on the location of the file and the type of file. Hence unlike block-level storage replication where a whole volume needs to be replicated, file replication products have the ability to exclude temporary files or parts of a filesystem that hold no business value. This can substantially reduce the amount of data sent from the source machine as well as decrease the storage burden on the destination machine.

On a negative side, as this is a software only solution, it requires implementation and maintenance on the operating system level, and uses some of machine's processing power (CPU).

File based replication can be done by a kernel driver that intercepts calls to the filesystem functions, filesystem journal replication or batch replication, where-in source and destination file systems are monitored for changes.

Database replication
Database replication can be used on many database management systems, usually with a master/slave relationship between the original and the copies. The master logs the updates, which then ripple through to the slaves. The slave acknoledges the updates stating that it has received the update successfully, thus allowing the master, sending (and potentially re-sending until successfully applied) of subsequent updates.

Also database replication becomes difficult when it scales up to support either larger number of databases or increasing distances and latencies between remote slaves.

Some common aspects to consider while choosing database replication:
  • Do you wish to cater to HA only or you want load balancing as well?
  • What is latency for replicaton? Are participant servers in replication, local or remote(site)?
  • Are you ready to contend with multiple masters or is single master good enough?
  • Can you split your database access into read-write and read-only queries?

The actual process of database replication usually involves shipping database transaction logs, from master to slave. This shipping can be done as file based transfer of logs or logs can be streamed from master to slaves. Databases can support "specialized transaction logs" which are solely for purposes of replication and are hence optimized.

If you wish only HA or  redundancy, than you can have slave/s in warm standby mode. In this case, master sends all updates to slave but the slave is not used even for read-only access, it is only an up-to-date standby.For HA, using shared storage for active master and inactive standby can be resorted to. But the shared storage becomes the SPOF(single point of failure). 

If you intend to have single master and mutiple read-only slaves, either your application must be smart enough to balance, database access as readwrite connections and read-only connections. Some databases like postgresql provide third party middle-ware libraries which bifurcate client's database access and send read-write request to master and read-only request balanced across multiple slaves.

If your application has need to load balance even the read-write requests, then you might think of going in for multiple masters. Many databases refer to this as database clustering or mutli-master support. Here, read-write commands are sent out to one of the masters and masters are synchronized through mechanisms like 2 phase commits. Multi master replication has challenges such as reduced performance and increased probability of transactional conflicts and related issues like deadlocks.

In addiiton, databases can support features like ability to load balance read-only queries, by allowing parallel execution of single query across multiple read-only slaves.

Synchronous replication implies, the synchronizations across slaves is guaranteed at the cost of commit performance against the master

For DR, remoteness introduces high latency, hence log shipping or async streaming of logs, to a remote warm standby is acceptable.


















Wednesday, March 18, 2015

Agile - Tips From Live Projects

I recently had a chance to learn about a live agile project as part of agile training programme.
Below are some of the tips I picked up, while interacting with the key members on the team.

The Big Picture
The overall release is divided into several phases as mentinoned below.
Notice the SIT and UAT are run independently, towards end of release.
This is time consumimg but does reduce risk of any bugs in release.
This is more relevant for softwares which needs extremely thorough testing.

  1. Pre foundation Phase (1 week) 
  2. Foundation Phase (2 weeks) 
  3. Sprint 1 (2 weeks) 
  4. Sprint 2 
  5. Sprint 3 
  6. Sprint 4 
  7. Sprint 5 Final / Hardening Sprint
  8. SIT Test (1 month) 
  9. UAT Test (1 month) 
  10. Release 


During foundation phase high level estimation done using technique called T-Shirt sizing
(S, M, L, XL, XXL, XXXL). This helps in deciding scope of sprints.

Plan for sprints to progessively gain velocity

Balancing RnD and delivery
Instead of running 2 week Spikes like Sprints, only Spike Tasks are run to balance RnD efforts with delivering usable functionality.

1 Story = Spike Task1 + Task2 + Task3....

Sprint includes a sensible mix of techincal stories and functional stories.
Stubs are used for technical components planned for future technical stories.


Estimation 
Stories estimated as story points using estimation poker
Task are always estimated / tracked at hours spent.
There is provision in tool, to estimate task hours and also put in actual hours.
Over several successive sprints, a good average estimate of 1 Story point = xyz hours crystalizes


Sample Story Status: 
New
Analyzed
Approved
Dev WIP
Dev Blocked
Dev Done
Test WIP
Test Blocked
Test Done

Dev Done includes code review
Test Done includes show-n-tell, given by tester to convince the  BA



Very succint definition of done

Sample DOD for dev complete:

  • Impact Analysis document done
  • Code complete
  • Code reviewed
  • NFR, performance, security tests pass
  • Regression tests pass
  • All known bugs raised by test team resolved

About Code and CI

Release-wise branch in SVN (version control)
No branches per sprint
No parallel sprints
SVN checkin format:
<Release>_<Module>_<Story>_<Task/Defect>: Description
Automated regression testing in place


Other Takeaways: 
1.
Very detailed acceptance criteria, Yes No questions, fill blanks, unambigious answers
Is xxx panel visible as seen in logical diagram? Yes/No
Does the table/grid contain 13 rows Yes/No
Quality of story determined by how detailed is acceptance criteria

 2.
Story Status like "Test Blocked", if tester cannot test a story, calls up dev immediately or writes email.
All Blockers get mentioned in standup

 3.
Always testing team is lightly loaded at start of sprint and overloaded towards end of sprint, to reduce this pain point ....
Have stories at very fine-grained level, eg each story should add up to only few hours of dev task.
This way, dev always has something testable for tester throughout the sprint Idle time for testers (waiting for testable functionality) is reduced.


Friday, February 6, 2015

Rapid development of REST webservices with persistence

Development of modern business applications, has to become faster and easier, if we are to meet business delivery demands, spanning over ever shorter cycles. Also we are now required to deliver softwares continuosly and incrementally, in an agile manner.

Jump-starting the development process with the right tools and technolgies is the need of the hour.

Rapid development, using currently popular technology, has always been in vogue. Right now exposing middle tier services as REST APIs is all the rage. So how do we develop REST services with persistence, in a rapid manner, with least effort?

I had written an earlier post on grails 2_3 providing out of box support for REST CRUD , writing only a simple java bean (DTO), using the grails framework. But that would mean using an entirely new framework like grails only to exploit its REST CRUD functionality. The engineer in me, thought that was an unoptimized use of grails. Why shoudlnt we have this, out of box REST CRUD using spring, hibernate the technologies that we know best already :-)

Well more recently, advancements made in the Spring Data JPA framework, makes it very very easy to develop CRUD REST services out of the box, writing only the entity classes with JPA annotations.
Well here is how...

Effectively, we will see, how to write only a simple java bean, annotated with JPA annotations for persistence and using spring data jpa and spring boot, we can automatically generate REST CRUD services for our bean / entity. All this using very little code :-)



Create a typical maven java project (not a web project) 
eg. using maven-archetype-quickstart

We will use spring boot based spring data jpa and spring data rest

Use a pom.xml similar to below

https://github.com/ganeshghag/RestSpringDataJpa/blob/master/pom.xml



As per spring data jpa, create persistent entity class say Person as a POJO and annotate it with JPA annotations

package com.ghag.rnd.rest.domain;

import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;

@Entity
public class Person {

@Id
@GeneratedValue(strategy = GenerationType.AUTO)
private long id;

private String firstName;
private String lastName;
private String email;
private String address;
private String mobile;
private Integer employeeId;

//getters and setters go here
}

Again as per spring data jpa framework, create an interface which will tell the spring data jpa framework, to create hibernate jpa based CRUD functionality with respect to the entity class Person we defined above

package com.ghag.rnd.rest.repository;

import java.util.List;
import org.springframework.data.repository.PagingAndSortingRepository;
import org.springframework.data.repository.query.Param;
import org.springframework.data.rest.core.annotation.RepositoryRestResource;
import com.ghag.rnd.rest.domain.Person;


@RepositoryRestResource(collectionResourceRel = "people", path = "people")
public interface PersonRepository extends PagingAndSortingRepository<Person, Long> {

List<Person> findByLastName(@Param("name") String name);

}

The methods in the interface like findByLastName, impart filtering capability on the the entity queries by entity attributes like lastName.


The annotation below
@RepositoryRestResource(collectionResourceRel = "people", path = "people")
ensures that, the CRUD rest services for the Person entity are auto-magically generated for us. This is the main fetaure, I wanted to publicize in this article.


Thats it, now using Spring Boot and its default sensible configurations, you can just right a class similar to below and you should be able to start up the applicatoin with an embedded tomcat server.

package com.ghag.rnd.rest;


import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.data.jpa.repository.config.EnableJpaRepositories;
import org.springframework.data.rest.webmvc.config.RepositoryRestMvcConfiguration;

@Configuration
@EnableJpaRepositories
@Import(RepositoryRestMvcConfiguration.class)
@EnableAutoConfiguration
@ComponentScan  //important and required
public class Application {

public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}


The application will expose the CRUD REST services for persistent entity Person at
http://server:port/people

you can easily test the services using curl commands like following:
curl -i -X POST -H "Content-Type:application/json" -d @insert.cmd http://localhost:8080/people
where file insert.cmd contains json for the Person object like thus:
{
"firstName" : "Ganesh",
"lastName" : "Ghag",
"email":"some@some.com",
"address":"flower valley, thane",
"mobile":"3662626262",
"employeeId":8373
}

In order to customize the application, writing custom REST controllers is as easy as writing below POJO

package com.ghag.rnd.rest.controllers;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;

import com.ghag.rnd.rest.repository.PersonRepository;


@RestController
@RequestMapping("/custom")
public class MyCustomController {

@Autowired
PersonRepository personRepository;

@RequestMapping(value="/sayHello/{input}", method=RequestMethod.GET)
public String sayHello(@PathVariable String input){

System.out.println("findall="+personRepository.findAll());
personRepository.deleteAll();
return "Hello!"+input;
}
}


To further enhance the CRUD application, we can easily
  • use above application as a microservice, using spring boot as is
  • export above application as a war and deploy it on traditional app server
  • use various logging for debugging
  • use spring based validations
  • use spring security
  • switch CRUD database from current default H2 to any of our choice using spring boot config file application.properties


Please refer spring boot documentation for accomplishing the above mentioned enhancments.
http://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/
http://docs.spring.io/spring-data/jpa/docs/current/reference/html/


By using hibernate tools reverse engineering, given an existing database, auto-generating JPA annotated entities is a no-brainer and with the above knowledge of generating REST CRUD services, it means, you can expose a database, as  REST services, with less than a days work.

Tuesday, September 23, 2014

ORM Tips

First Off, ORM is tougher than writing declarative SQL, make sure you have experienced developers, who are willing to "really understand" the data model, rather than developers who want to just "code-break-fix"


When you look at any persistent entity attributes think of them as following 3 categories:

  • "flat" attributes, that are stored in the table that the parent entity is mapped to
  • "referred" attributes, these merely refer to other entities, whose life cycle is independent of parent entity. Typically entities shown in dropdowns, lookups, master data entities etc
  • "cascade" attributes, refer to entities that may have CRUD cascades w r t the parent entity

The more the number of cascade attributes, the more the complexity in persisting the current entity
(this can be used for estimations)

Inserts are easy, updates are tricky. Allocate enough time for testing entity update, with different combinations of "cascade" attributes. (null, existing, new entities)

Complex cross entity (readonly) queries, need not go through the ORM. (caution: there are risks of stale/out-of-sync data, if not done properly)

The service code is not tested, till all lazy initialization has been exercised. Developers release the code as working too soon, but only when the services layer is invoked from the controllers, where-in the json serializers try to access the entire depth of the object graph, the code starts throwing data access exceptions. Working with shallow hibernate proxies often lulls us into false sense of security temporarily.

To translate entity objects into VO / TO objects and back, use object oriented code, rather than procedural, so let each VO / TO take care of its own mapping to corresponding entity object.


To test REST services, you can use simple HTML with jquery, but better is JUnit tests which use Spring's RestTemplate as REST Client and make REST service calls. This kind of test is very authentic (exercises controllers realistically)
and more importantly...the test code is strongly typed, so in case your VO changes, the compiler will break your test and catch your attention, this makes refactoring smooth.