Friday, December 11, 2015

Microservices using Pivotal Cloud Foundry

Have you ever faced the following issues in your projects:
1. Roll out features quickly to production:
The customer needs very minor changes to functionality/UI and would like that to be rolled out at the earliest, because they have a higher business impact. However, we are not able to deliver it, because though it is a small change, we have to go through the whole nine yards of release processes involving full regression etc. 

2. Move your application from one datacenter to another:
You need to move your application to a different data center in a different geography due to various reasons (like legal, merger and acquisition etc). This might seem to be a herculean task to you. More often than not, you would get caught in a web of discovering (often surprising yourself!), a myriad of system configurations spread across file system, database, even worse within the code!!!. This is only going to make your task even more daunting.

3. Scale out your application based on modules:
You might observe that few modules of your application face a huge load compared to others, and you might want to dedicate more resources to these modules instead of scaling out the entire application. However, you are not able to do this at a module level, thus wasting money/resources by scaling the entire application.
Is there a better way out???
Think Micro Services!!
Micro Services help us break down the application which was a monolith hitherto, into smaller deployable units. So no more, one huge .ear  file having large number of war files. Typcially, they are built as stateless REST services providing a certain modular functionality. And each of these services could be built and deployed individually.
This loosely coupled architecture would allow us to seamlessly develop and deploy individual micro services quickly and be truly agile.
So Micro Services will try to address issues #1 and #3 above.
So, will just breaking down into smaller Micro Services serve as a magic bullet and solve all the problems?
Well, not really. It might introduce complications and more problems, unless we manage it better.

How to manage micro services?

Use of a PAAS (Platform As a Service) would help us to manage the Micro Services environment better. In this article, I would dwelve upon Pivotal Cloud Foundry (PCF), which is a PAAS offering from Pivotal Inc. which could be deployed on top of any IAAS offerings like VMWare, AWS etc.

Why PCF?

1. PCF supports a large number of run times including Java, Python, PHP etc. So there is no vendor lock-in and you can pretty much develop using the language of your choice (and which is supported by PCF).

2. One of the principles to be followed here is to isolate configuration from code. The same code/deployable artifact (like .war for example) should work in all environments without change.
(This principle would address issue #2 mentioned earlier in the article). PCF provides a configuration server which would serve as a centralized server providing all the system configurations, thereby allowing the platform to spawn new instances of the application seamlessly.

3. PCF provides a wide range of services through marketplace. It includes (no sql databases like Mongo DB are provided a service), messaging system (like Rabbit MQ) etc. These services ease the development and deployment/configuration efforts. Even more, these services could be bound to our application with a single command. Look ma, no hands!!
Overall, PCF appears to be a suitable solution for quickly building and deploying microservice applications onto the cloud.

Sunday, March 22, 2015

Map Reduce Explained

In this post, I'm going to write about Map Reduce Algorithm. 
Map Reduce has taken the computing world by storm in recent years. It's massive scalability has given an edge to solving complicated problems.

Map Reduce is often explained by the famous counting example (word count more often than not). I would also take the same route, albeit, a different example., let us count the number of books with the same name.


Example:
Suppose, there are 3 book shelves and each of them have different books.

For brevity, let us say there are 4 different books in each of these shelves.

First Shelf:

"Harry Potter and the Half-Blood Prince" (count: 3)
"The Da Vinci Code" (count: 1)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 2)

Second Shelf:

"Harry Potter and the Half-Blood Prince" (count: 2)
"The Da Vinci Code" (count: 2)
"Think and Grow Rich" (count: 3)
"The Bridges of Madison County" (count: 1)

Third Shelf:

"Harry Potter and the Half-Blood Prince" (count: 1)
"The Da Vinci Code" (count: 4)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 1)


Map function:

The map function will create a map with a key and value. In our case, since we are counting books of each name, we can have the book name as the key. Let us put a value of 1 for each book.

Following is the pseudo code for the map function:
function map(books){
      Loop thru all books
     emit book.name as key, 1 as value.


}

Let us apply the map function for the first shelf. Note that since there are 3 "Harry Potter and the Half-Blood Prince" books, there will be 3 key/value pairs for this book each with value 1. The mapper will not (in this example), try to aggregate the counts.

Similarly, a mapper is associated with all the three shelves, and similar key/value pairs are generated for each shelf.The key/value pairs as in the following diagram will be generated.



Shuffling:


In this step, all the map entries with the same key will be assigned to one reducer. For eg, all "Harry Potter and the Half-Blood Prince" key/value entries will be assigned to one reducer.

Reduce:


A reduce function will receive a key and a list of aggregated values as input. 
For the first shelf, it will receive  "Harry Potter and the Half-Blood Prince" as key input and a list of [1,1,1] as value input. 

function reduce(name as key, List of values){

Loop thru each value in List of values
sum += value
End Loop
emit name, sum

}

Following diagram shows the 3 Reducers in action:



So, we get the book name "Harry Potter and the Half-Blood Prince" and the corresponding count 3 as value.

Overall Architecture:




Depending on requirement, the system can have multiple mappers and reducers, using a master/slave architecture, with the master deciding on shuffling and providing data to each reducer.


There are different frameworks which provide Map/Reduce programming model, Hadoop being one of them.

Saturday, January 31, 2015

Spring Integration - Bulk processing Example

In this post, I'm going to share my experience in using Spring Integration for a bulk processing task. This was the first time I was using Spring Integration and it did not disappoint me. It was pretty robust in terms of error handling and definitely scalable.

Disclaimer:

The actual task being performed has been modified into a fictitious e – commerce domain task, since I cannot reveal the original task which is customer specific and confidential. However, the nature of the task in terms of processing of clob and xml remain the same and the complexity of the original task has been retained.

Also, the views posted on this post are strictly personal.


Objective:

The objective is to work upon customer information from an ecommerce application and process it and save it into database as clob data.

The task is to perform data processing by reading xml from a file in the file system and then store it as clob into a table.

Following is the high level of tasks required to be performed:
  1. There will be one xml file for each customer and the file would be prefixed with the corresponding customer Id.
  2. The xml has customer’s personal information. The file would be named as customerID_personal.xml. For eg,  1000_personal.xml
  3. Read personal info file (say with name 1000_ personal.xml) from file system.
  4. Perform some transformation on the xml.
  5. Perform some more processing on the resulting xml and then save the xml into a database table as a clob data against the customerID.

Constraints:

  1. The volume of files is high running close to more than half a million.
  2. Need track of which file process succeeded and which one errored out, with reliable/robust error handling.
  3. This is not a real time processing application. i.e, files are dumped into the file system by a separate system and the processing can happen offline.

Design Approach:

  1. The customerIDs are initally saved into a table (say ALL_CUSTOMERS) with a status column (STATUS_ID), with values enumerated for NOT_STARTED, IN PROGRESS, ERROR, COMPLETE.
  2. Instead of polling the files in the file system, we will poll the ALL_CUSTOMERS table to get the records which are in NOT_STARTED status and process them.
  3. When we start processing each record, we will update it to IN PROGRESS.
  4. If there is any error during processing, we will update it to ERROR.
  5. If there is no error during processing, we will update it to COMPLETE.

Use of Spring Integration to solve this problem:

Following diagram shows the process flow. 


Step 1 - Polling records from database:

Goal is to poll records which are in NOT_STARTED status and place them on a channel
  1. Create an inbound channel adapter which polls database at regular intervals. The inbound adapter could be a jdbc adapter. Also, we can associate a poller to the adapter to poll the database at regular intervals.
  2. The JDBC adapter:
    • has as a query attribute, where we can specify the query to be run when the poller wakes up.
    • Has a rowmapper attribute, where we can specify a POJO, which represents each record retrieved from the database.
    • Has a “update” attribute, where we can specify a query which will be run after polling each record. We will use this to update the status of the polled record to “IN PROGRESS”
  3. The inbound channel adapter connects to a channel, “dbInputChannel” where it places the data retrieved from database.

Step 2 – Arrange for parallel processing of each of these records

Connect the dbInputChannel to a record splitter, which will split each of the records and put it in another channel “storageDataFileReadChannel”. So, here we are creating one thread per record read from the database and then these records are processed in parallel. 


Step 3 – Use of Headers

Place each of the record read as a header data (which is like session context), so that the CustomerID is available for all channels in the workflow till the end.


Step 4 – Read Personal Info file from file system.

Create a service activator (which is essentially a Java service class), which reads data from a channel, invokes the service method and then places data on a output channel.

  • Input channel : richStorageDataFileReadChannel
  • Service method: fileReadAdapter.readFile
  • Output channel: richStorageDataOutputChannel

Step 5 – Perform XSLT transfomation on the personal info data

Create a Transformer which will act on the personal info data, apply XSLT transformation and place the output on an output channel.
  • Input channel : transactionSourceChannel
  • XSLT :xsl/customTransformation.xsl
  • Output channel: transactionOutputChannel

Step 6 – Perform processing logic on the transformed data

Create a service activator (which is essentially a Java service class), which reads data from a channel, invokes the service method and then places data on an output channel. This service method will also save the data back into database and will also mark the status of the record in ALL_CUSOMERS table as COMPLETE.
  • Input channel : transactionOutputChannel
  • Service method: transactionService.processData
  • Output channel: processedTransactionChannel

Error handling:

A global error handler is used which will set the record status to “ERROR”. This error handler would be invoked if there is any error at any step in the entire workflow.

Results, Performance:

Overall, the workflow worked like a charm. There were some errors encountered occasionally, sometimes due to invalid xml in the file system mainly due to presence of special characters. 
However, the error handling mechanism took care of it, the record was marked as error and importantly, the process didn’t stop there. 


The speed was good too, with a processing rate of 120 records in 5 minutes (on a not so high end processor and the original task involved two different file reads per record). Note that I had set the timer to poll the database every 5 minutes. It could process 120 records in parallel, before it could start the next poll.

Conclusion:

I would recommend to use Spring Integration because of its ease of use and robustness. It is well documented too and is a good fit for bulk processing.