SIGSTP: August 2013

According to some recent news, "Google has finally admitted they don't respect privacy,". Understand Google has "admitted" they scan Gmail emails for profit.

I've been using Gmail for quite a long time now and well, this is no news at all. It's always been quite clear that by using Gmail, you agree that Google will scan your email content to target advertisement. This is how the (great) service remains free. No mystery there. Nothing to admit really, and as Google puts it, it's always been part of their standard Gmail ordinary business practises. Since 2007.
It's 2013 and Consumer Watchdog called the "revelation" a "stunning admission". Are those people so dumb it took them 6 years to realise what gmail is?

The answer is probably yes, given the fact that John Simpson, Consumer Watchdog’s privacy project director has declared: "sending an email is like giving a letter to the Post Office. I expect the Post Office to deliver the letter based on the address written on the envelope. I don't expect the mail carrier to open my letter and read it.".

What is striking in this statement is that the man clearly doesn't know what he is talking about. Sending an email the standard way (meaning without using any additional encryption techniques) has never been like sending a letter (understand a message hidden in an envelope). From the late 60's/early 70's, sending an standard email is more like sending a post-card without envelope wrapping. It will arrive at destination, but you can't expect the postmen not to have a quick look at it. By the way, this is how spam/malware/virus detection works.

A good email analogy

The fact that this John Simpson is not already aware of that clearly demonstrate his lack of technical culture, which is quite ridiculous and worrying for a guy who is a "privacy project director". If I was a "privacy project director", I would probably encourage people who actually care about their privacy to use encryption techniques. Many solutions are available, and again this is no news at all and used routinely by businesses.

All of that makes me think that 20 or so years after the incredible expansion of the web, and 40 something years after the invention of emails, the general public (by that I mean people who are not in IT) is still lacking essential basic IT knowledge. Like how do emails work (not in details of course), or how a website (a simple one) works. Probably the solution resides on better basic computer education.

Gearman is a great tool to run asynchronous and/or distributed jobs over a cluster of machines. It's not a full general message queue ala RabbitMQ, but a rather minimalist piece of software that is very simple to use, does only one thing and does it very well.
To use gearman from your favourite language, the cpan provides two modules. The original Gearman and the more recent Gearman::XS. For this post we're going to use the original Gearman, but you should be able to implement the example using Gearman::XS without much trouble.
In a nutshell, writing for gearman is a two sides process, writing some client code and some worker code to actually do the job, with gearman just sitting in the middle and distributing the jobs:

Client(s) <--> Gearman Deamon <--> Worker(s)

Vanilla gearman

Let's have a quick pseudo code look at an asynchronous example inspired by the Perl package Gearman:

Worker:

my $worker = Gearman::Worker->new(...);
$worker->register_function('sum' => sub {my $job = shift; calculate sum of decoded($job->arg) and store it somewhere });
$worker->work();

Client:

my $client = Gearman::Client->new(...);
my $handler = $client->dispatch_background('sum', encoded([1, 2, 3]), { uniq => some string unique enough });
## wait for the job to be done
while( my $status = $client->get_status() && $status->running() ){
sleep(1);
}
.. Retrieve sum from the storage and print it ..

By the way 'encoded' and 'decoded' are entirely up to you, as long as they encode to bytes and decode from bytes. I personally use JSON, but it's a matter of taste (and performance but that's another story).
Also, I deliberately skip the task sets and synchronous mechanisms as this is not supposed to be a gearman tutorial :)
So here we go. You make sure that your gearman daemon is running, you fire up your worker script and while it's running, each time you run your client, it will print 6.
You feel great. You've written your first minimalist gearman application and you're ready to gearmanize the rest of your long running and/or easily distributable code.

Why this approach is a pain

So following the example, the temptation is great to just extend the worker like that:

$worker->register_function('my_long_specific_thing', \&gm_do_long_specific_thing);

I guess that if you're reading this post, you probably already have something in your model code that is long and already packed into a function:

package MyApp::Object;

sub do_long_specific_thing{
my ($self, $arg1 , ... ) = @_;
...
}

So in reality your gearman specialised 'gm_long_specific_thing' will probably look like that:

sub gm_long_specific_thing{
my($job) = @_;

my $args = decoded($job->arg());

my $application = .. Build or get application ..;

$application->get_object( build object getting from args )->do_long_specific_thing( build arg1 , arg2 from the args);
}

Then you know the rest of the story. Every time you need something else to be gearmanized or every time you need to make a change to the arguments of one of your gearmanized method, you have to propagate your changes to the specific gearman registered functions. Your code has become a bit less maintainable, just because you want the benefits of gearman.

Fixing it

But what if.. you could write something like that:

$client->application_launch(sub{ my $app = shift;
my ($oid) = @_;
$app->get_object($oid)->do_long_specific_thing(1, 'whatever');
},
[ $oid ] );

No more worker specific code to write. Everything is done from the application's point of view, leaving the back-end details out of the way.
Want to change the API of do_long_specific_thing? No problem, just apply parameter changes where they appear in the code. No more headaches propagating the API change through the gearman specific methods.
Want to make your gearmanized process longer without changing do_long_specific_thing? No problem:

$client->application_launch(sub{ my $app = shift;
my ($oid) = @_;
$app->get_object($oid)->do_long_specific_thing(1, 'whatever');
.. and something else ..
},
[ $oid ] );

What if you have another long thing to gearmanize? Well, you get the picture:

$client->application_launch(sub{ my $app = shift;
my ($oid, $arg1) = @_;
$app->get_object($oid)->do_another_thing($arg1);
.. and something else ..
},
[ $oid, $arg1 ] );

Implementing it

The client 'application_launch' method

Thanks to B::Deparse, we can turn any sub into a plain string. The rest is trivial, so

here we go:

## In some object that wraps the $gearman_client
## you can also inherit if you prefer. But I like wrapping more, cause you can store utilities.
sub application_launch{
my ($self, $code, $args ) = @_;

$code //= sub{};
$args //= [];

my $gearman_client = $self->gearman_client() OR just $self;
my $deparser = B::Deparse->new('-sC'); ## C style
my $json = JSON::XS->new()->ascii()->pretty(); ## I like pretty. I know it's larger but well..

my $code_string = $deparser->coderef2text($code);

my $gearman_arg = $json->encode({ code => $code_string,
args => $args });

my $uniq = Digest::SHA::sha256_hex($gearman_arg);

my $task = Gearman::Task->new('gm_application_do', \$gearman_arg, { uniq => $uniq });

my $gearman_handler = $gearman_client->dispatch_background($task);
unless($gearman_handler){
confess("Your task cannot be launched. Is gearman exposing the gm_application_do function?");
}
return $gearman_handler;
}

And that's pretty much it.

The worker 'gm_application_do' code

The worker code is very similar.

$worker->register_function('gw_aplication_do', sub{ _gm_application_do($application, shift) });
sub _gm_application_do{
my ($app , $job) = @_;

my $gm_args = $json->decode($task->arg()); ## Note you need a $json object.
my $code_string = $gm_args->{code};
my $code_args = $gm_args->{args};

my $code = eval 'sub '.$code_string;
$code || confess("EVAL ERROR for $code_string: ".$@); ## That shouldnt happen but well..

## This is not supposed to return anything, as we call that asynchronously.
&{$code}($app, @$code_args);
return 1;
}

Adapting to synchronous calls

Adapting that to synchronous calls is quite straight forward.
Remember gearman exposed functions should always return a single scalar.

Conclusion

gotchas:

This doesn't work with closures, so really your sub's should be pure functions and all parameters should be given as such.
As far as the magic goes, the parameters can ONLY be pure Perl structures; something that's serializable in vanilla JSON.
If you try to pass bless objects, bad things will happen.

Disadvantages:

Insecure. What if anyone injects sub{ destroy_the_world(); } in your gearman server. That's kind of easily fixed. Just implement some secure signing of the code in transit.
No strict control of what can be done through gearman. Developers enlightenment is the key here.
No strict control about what 'flavour' of gearman worker is running your code. Some people like to have specialised gearman workers exposing only a subset of functions. This can easily be fixed by adding a 'target' option to the application_launch method and exposing the same general purpose gm_application_do under different names on different machines. But again, choosing the right target falls under the developers responsibility.

To sum up the advantages:

Stable gearman worker code that's decoupled from the application code itself.
Flexible gearmanization of any application code you like.
Clarity of what's going on. No more parameter encoding/decoding to write, and no API change propagation through what should be infrastructure only code.

Perl offers us the flexibility that empowers us to clearly separate code that deals with different concerns.
As developers, we should take advantage of it and build reactive, flexible and generic enough business components.
As an infrastructure developer, I don't really know what people are going to do with gearman, nor should I care too much. This approach let me concentrate on what is important: the stability, scalability and the security of my gearman workers.
As an application developer, I don't care that I have to encode my functions parameters in a certain way. And I don't want to bother changing code in some obscure gearman module when I make changes to my business code. What I want is to use gearman as a facility that helps me design the best possible application without getting on my way too much.

Hope you enjoyed this post.

Until next one, happy coding!

Jerome.

Thursday, 15 August 2013

The fuss about Gmail and emails.

Tuesday, 13 August 2013

Get a B::Deparse piggy back through Gearmany