Friday, December 5, 2008

Favorite Projects Series, Installment 2

The previous project in this series I chose as a favorite due to the impact it had for the user. I consider this second project to be a favorite for the interesting technical challenges that it presented. While it certainly had impact for a lot of users, that was far less visible to me.

I lead a team of about three developers on this project at A.G. Edwards. The project was named BLServer based on a service by the same name that we used from an external provider. This provider's service was implemented using a COBOL program that would handle requests over a limited number of network connections. It had a number of limitations that we needed to overcome, which we did by wrapping it with Message-Driven Web Service running in a Weblogic cluster.

The function of the BLServer service was to receive messages for account creation and modification, including changes in holdings of various securities. The provider's BLServer had a nightly maintenance window (I think it was about four hours), and used a proprietary message format. It was secured by using a dedicated leased line and a single common password.

Our wrapping BLServer service was required 1) to expose a standards-based (SOAP) interface, 2) to provide 24x7 uptime, 3) to preserve message ordering, 4) to secure access to the service without exposing the single common password, and 5) to queue requests during maintenance windows delivering them when the provider's BLServer became available. There were also scalability and performance issues which, in combination with message ordering requirements, drove us to an interesting solution. I'm not sure about the exact scalability requirements, since that was four years ago. If I remember correctly, we initially had to be able to handle about 300,000 requests during a normal 8-hour business day, with the ability to handle peak loads of around 1.5 million per day.

The first benefit that our service provided was to expose a standards-based (SOAP) interface, and interact with the provider BLServer which took requests and delivered responses using a proprietary protocol and message format. Our service was then used by application Web Services to provide customer value.

In order to meet the scalability and availability requirements, we proposed standing up a small cluster of WebLogic engines to host a BLServer WebService. This WebService would receive requests and (using JMS) queue them for processing. Responses would then be queued for later retrieval by the calling services. By queuing requests in this way, we could use the transactionality of the JMS provider to guarantee that each message was processed once and only once. Furthermore, we could queue up a backlog of messages and feed them through the finite number of connections made available by the provider BLServer.

By using a cluster, we would be able to handle the necessary load of incoming requests, queue them, and run them through the provider BLServer, keeping it as fully loaded as possible over the finite number of available connections.

Aye, but here's the rub. We had a cluster of WebLogic engines pulling messages from the queue. How do you go about maintaining message order while at the same time leveraging the parallelization of the cluster to handle the load? Consider what happens if you have two messages in the queue in order. The first is a stock sell that results in an increase in available cash. The second is a buy that uses the that cash to buy some other stock. You can see that these must be processed in order. If one server in the cluster grabs the first from the queue, and another grabs the second, there's no guarantee that the sell will be handled first by the provider BLServer. Therefore, we have to guarantee the order in our BLServer service.

How to do that? The solution became more obvious once we realized that total message ordering was not required. What's really required is that messages within certain groups be correctly ordered. These groups are identified by key, and all messages for a given key must be ordered. Depending on the type of request, that key might be a CUSIP, might be an account number, or some other identifier.

Now message ordering with scalability becomes simpler. If all messages for a certain key are handled by a given engine, then we can guarantee ordering by pulling a single message at a time from the queue, and processing it to completion before beginning the next message. Other engines in the queue will be doing the same thing at the same time for other keys. Thus, we gain some scalability.

Oooh, but we've just introduced Single Points Of Failure (SPOFs) for each key. If a given server that handles keys that start with '17' for example, and that server crashes, then messages for those keys won't be processed, and we have failed to meet our availability requirements. That's where the second bit of creativity came into play. We employed a lease mechanism. Leases were stored in a highly-available database. Upon startup, a given engine would go to the database and grab a lease record. Each lease represented a group of keys. For example, a lease record might exist for all records starting with the range '00' to '03'. An engine starts up, finds that this lease is the next available, and grabs it. In order to 'grab' a lease, an engine will update the lease with a time in the not-to-distant future, say, five minutes. As long as the engine is up, it will continue to update the lease every two minutes or so with a new time. If the engine crashes, the time expires, and some other engine grabs the lease.

As long as an engine has a lease for a given range, it can use a selector to receive messages from the queue for that given range. We now have scalability, message ordering and high availability. Everybody say, "Woah, that's so cool!"

At this point, we've solved a significant technical issue that should be captured as an architectural pattern. We never did that. It may be that this solution is documented somewhere as a pattern, but I'm not aware of it.

At the end of the project I moved on to other things, and left BLServer in the capable hands of my friend Brian S. I heard some months down the road that the service was in active use in production, and had seen only one minor bug. I've always been proud of the product quality that our team delivered. We went through at least four or five variations of possible solutions before arriving at the one described above. In each case, we'd get into the details of the solution only to ask, "yeah, but what happens if..." and realize that we were close, but had some problem because of the distributed nature of the environment, or whatever. It was very satisfying to finally arrive at the elegant solution that we delivered.

6 comments:

Anonymous said...

Very informative post! Thank you for writing about this. I am looking to do this exact thing in my current project. We need to offload web insert requests to a message queue do to insert lock contention, but they need to be processed in the order they are received.

Or as you pointed out, at least in the order they are received on a per-login basis.

You've given me some great ideas, so thanks for that.

Don Branson said...

You're welcome, and thanks for the feedback. It makes me glad that I could help you out.

Marcio Joffily said...

That is exactly the problem I'm trying to solve. Sequential processing or messages for accounts within a jboss cluster. You've nailed it.
Well done

Don Branson said...

Thank you! Glad it helped.

Anonymous said...

Thanks, Don I have implemented a similar pattern and was worried about my solution. Then came across your blog who has implemented a similar kind of solution

Don Branson said...

Cool. Always glad to help. :)