terça-feira, 12 de outubro de 2010

Time! What is time?

Soon or later when you are working in a software project you will face a little element that can cause quite a big mess: time.
When I say time, I am not talking about the time to finish the project, time to perform a specific task or the software performance; in this case time is the time element of the data that you are working with. Doesn't matter if it comes as an hour and minute representation, a timestamp or millisecond counter; soon or later it will be there, and even if looks simple evolve into a irritating problem.

The usual solution to the statement "We need a timestamp!!!" is to get the millisecond of the computer clock and store it into the database, but in almost all the cases it is not enough. To know if it is enough or not, first we need to understand why do we need the timestamp; is this timestamp used only for user information, something like what date/hour something happened, or does it have a meaning for the system?

If it is only for user information, maybe a timestamp is not needed, you can have date (which includes the time) fields and make you user happy. But when it starts to mean something for the system things get a little bit more complicated; but if you need it for both purpose I would say to start splitting it: the user information should go in a field (even if it is created by the system) and a field with the timestamp; in this way you can have a field with comprehensive information for the user AND a field that can be managed by the system without magic.

So the next question is: why do the system needs a timestamp? To explain that I will start explaining what are the properties of the time, and based on those properties it is easy to identify where it could be used.

"But what, in fact, is time? If one concentrates on the structural aspects, then the prevalent view of time is that of a set of “instants” with a temporal precedence order satisfying certain obvious conditions such as transitivity, irreflexivity, linearity, and density. Interestingly, however, in most cases when we make use of time and clocks we do not need all these properties – for example, digital clocks obviously do not realize the density axiom but are nevertheless useful in many cases."[1]

So basically a time is something that can give to you the temporal idea, did A happen before B? Did something happen between A and B? So when you want to enable you system to work with such kind of question a time element is required. The next question is: how do I do it?

The first reaction is to use the machine timestamp, but doing things like this is a little bit danger because: to events for the same process can happen at the same time (remember that you are working at milliseconds level and you easily can have more than one event happening at the same milliseconds). Or you could have more than one machine, even in the server environment, today with clusters and high processing scale it is really likely to happen. Imagine that you have two events happening for the same process in different machines, they could end with the same timestamp, although they must follow the linearity rule.

I will skip the notation for the time related events (all those: before, after, depends, not depends), the important rule is: if the events are time related they should have a sequence, so I should be able to define which one happened first and which happened after. For that we need a consistent timestamp, and in most of the cases the easiest implementation for it is a Logical Time. What means that you have a time that has the transitivity, irreflexivity, linearity and density needed for a time, but at the same time it doesn't depend of a real clock.

There is different ways to implement a logical clock, one of the simplest is a Scalar Clock: in a nutshell a scalar clock creates a timestamp every time that it is asked, and for each creating the timestamp generated will be greater than the previous one, and it never creates something backward, always forward (in real life can you go back in time and do things that you would like to? or undo something? no, so in software land it should is also not be possible).

You can see it as a counter, which always increase. Looks simple and stupid, but such kind of clock should guarantee that doesn't matter what happens it will not create something into the past; but it is not among those guarantees the need to have sequential timestamps, in fact it could jump sequences.

Simple? Well not that much, imagine that if you have multiple machines in your cluster such clock should provide valid timestamps for all those machines, at the end it shouldn't matter where the timestamp was created, they should attempt to the constrains.

A bottle neck? Yes it could be, but according to the Peng and Dabek the Percolator scalar clock "serves around 2 million timestamps per second from a single machine"[2], so once well implemented the scalar clock could be a feasible solution.

You should notice that a scalar clock doesn't differentiate between processes, so even not related events will have a time relation. When it is not need or desired you can start looking for different type of clocks like a Vector clock or a Matrix Clock[1][3].

At the end remember, it is important to have consistent timestamps, and most of the times the system timestamp is not enough to conform to the required constrains (transitivity, irreflexivity, linearity and density). So if you need a timestamp, before doing a getTimestamp look around and check what other options you have.

[1] Mattern, F. – Logical Time. Darmstadt University of Technology.
[2] Peng, D., Dabek, F. Large-scale Incremental Processing Using Distributed Transactions and Notifications, pp. 06.
[3] AGUILERA, M. K., MERCHANT, A., SHAH, M., VEITCH, A., AND KARAMANOLIS, C. Sinfonia: a new paradigm for building scalable distributed systems. In SOSP ’07 (2007), ACM, pp. 159–174.

The natural song choose for this post is Time What is time, from Blind Guardian.

Nenhum comentário: