Monday, April 20, 2009

Marblecake: überhacking a badly written poll

Wow, now this is enthusiasm. The 4chan crowd hacked a Time.com poll quite carefully:

To actually manipulate the poll, Zombocom wrote two perl scripts. The first one, auto.pl is pretty simple. It finds the highest rated person in the poll that is not in the desired top 21 (recall, there are 21 characters in the Message) and down-votes them (you can view this as eliminating the riff-raff). The second perl script, the_game.pl is responsible for maintaining the proper order of the top 21 by inspecting the rating of a particular person and comparing that rating to what it should be to maintain the proper order and then up-voting or down-voting as necessary to get the desired rating. With these two scripts, (less than 200 lines of perl) Zombocom can put the poll in any order he wants.


Read on for the full review, there are many other interestings hacks in the story.

/via Schneier on Security/

Wednesday, April 15, 2009

That strange language called PHP

PHP is generally a nice language: it's quick n' dirty enough to make it real easy to hack together simple utilities, happens to avoid the drawbacks of the mother of all hack-together languages, Perl (as PHP code does not tend to converge to its own MD5 hash), and its resemblance to C in syntax and its forgiveness for smaller programmer mistakes makes the learning path extremely short. On the other hand its object oriented language elements and the good templating support make it usable for larger projects -- such as SSB and SCB at us.

However, there're some real bad quirks that can make your life miserable if you forget that your're dealing with PHP and not with a more "mature" object-oriented language like C++ or Java. These are not news for anyone seriously involved in PHP development but can raise an eyebrow for someone experienced in other languages.

One such group of problems originates from types in PHP. PHP is weakly typed language: you don't have to care about specifying types for variables and conversions happen dynamically. At first, that seems to make things easier, but that's not necessarily true. To quote one of our fine PHP developers: "The time you save on not having to care about types is exactly the time you lose trying to hunt down errors caused by types". A good example of how confusing things can get is the following piece of code:


if (0 == false)
echo "first\n";

if ("foo" == 0)
echo "second\n";

if ("foo" == false)
echo "third\n";


The output will be "first, second", but not "third". It's not illogical if you think about it: the first is an int --> bool cast, where everything that's not zero is true, so that experession's going to be true. The second one is a string --> int cast, where PHP tries to interpret the string as an integer and if it fails to do so, it results in zero, so this expression is going to be true as well. The third is a string --> bool cast, where the empty string, and the string "0" are interpreted as false, everything else is true. All steps make sense and they couldn't be done better, but in overall, this means that the "==" operator is not transitive in PHP. Which feels just plain wrong.

The other big issue is the handling of references. In C, however complex can all the pointer wizardy be for a newbie programmer, after you get a grip of it it's real simple and straightforward. In PHP, lots of things happen automatically based on context and some heuristics, which helps to avoid large and bad segfaults, but makes it much harder to do what you really want.

By default, PHP passes function arguments by value, right? Assigning a variable to another using a mere "=" creates a copy, doesn't it? Yeah, well, not always. The huge exception is objects. Take the following example code:


class TestClass
{
public $testData;

function __construct($data) {
$this->testData = $data;
}
}


$testobj1 = new TestClass("foo");
$testobj2 = $testobj1;

$testobj1->testData = "bar";
echo $testobj1->testData."|".$testobj2->testData."\n";

$testarr1 = array("foo");
$testarr2 = $testarr1;
$testarr1[0] = "bar";

echo $testarr1[0]."|".$testarr2[0]."\n";


It will output "bar|bar bar|foo", which clearly shows that assigning an object to another does not create a copy. You can use the "clone" keyword for that (and define the __clone() magic function for a copy constructor if you will). On the other hand, unlike in C, arrays behave like nice normal variables and get copied by value. The very same thing happens at passing function parameters:


class TestClass
{
public $testData;

function __construct($data) {
$this->testData = $data;
}
}

function testFunc1($obj) {
$obj->testData = "bar";
}
function testFunc2($arr) {
$arr[0] = "bar";
}


$testobj = new TestClass("foo");
$testarr = array ("foo");

testFunc1($testobj);
testFunc2($testarr);

echo $testobj->testData."|".$testarr[0]."\n";


This will output "bar|foo".

The last odd thing I'd like to share is the way you can return a reference from a function. You have to 1) prepend the function name at its definition with "&" and 2) use "=&" when you assign its return value to a variable. Forgetting any of these will not, in any way, trigger a warning or an error -- your variable just gets copied. Good luck hunting down the missing & over a trail of 10+ embedded function calls. The following code will write "bar|bar|foo|foo":



function &func (&$value) {
return $value;
}
function func2 (&$value) {
return $value;
}

$testvar = "foo";
$testvar2 =& func($testvar);
$testvar3 = func($testvar);
$testvar4 =& func2($testvar);

$testvar = "bar";

echo "$testvar|$testvar2|$testvar3|$testvar4\n";


As a last journey into PHP oddity I'd like to give the following as an excercise -- what does it print? (No cheating, please)



$a = 'asdasd';
var_dump($a);
var_dump(isset($a['foo']));
var_dump($a['foo']);

Tuesday, April 7, 2009

The biggest minor release: SSB 1.0.2 is out

After a long development and testing cycle full of unexpected troubles and nice challenges, yesterday evening I've pressed that Enter button at the end of the commandline "zbs bundle gen ssb-1.0/1.0.2/export" and we've officially released the next version of syslog-ng Store Box.

I'm pretty sure Marci will post about the shiny new features in the release (after all, he's the product architect whose job is to be the visionary behind the product and oversee these things :)), which are quite numerous, including support for SANs and X4540, huge performance increase in the indexer on multi-processor systems, an under-the-hood overhaul of the dashboard and reporting and an automated troubleshooting information collector. But for me as the one acting as a translator between written and talked-over feature specs and actual code and the testers most of my time, this release was mostly about how many features and fixes we've crammed in it.

I'm stealing some stats Marci collected for the internal news here to prove my point: we've dealt with 266 bugs (most of them new features or bugs caught during the preliminary testing of those new features, so it's not that we've shipped 1.0.1 in such a bad shape :)), changed 10k+ lines in appliance code in ~400 commits and ~1500 lines in the underlying syslog-ng in 60 commits. We had several bugs containing more than 30 comments showing the intense disagreements we sometimes had between our stringent testers and the developers. Gitstats revealed some other interesting pieces of information: the far most popular time of the week for commits was, for some weird reasons, 16h on Tuesdays, but we've committed a significant amount after nine in the evenings and more than 5% of the code was integrated on Saturdays. We've sent in more patches in the last month then in the previous two of the development combined and most of the team managed to gather 20+ hours of overtime in the rush of the last weeks working hard to get this release out.

All in all, it was an incredibly busy three months and I think we've done much more than producing a simple maintenance release with some small bugfixes here&there -- and that we can be really proud of our achievement. Thanks for everyone involved.

Oh yeah, and the most important fact that brought a real wide grin to the face of us web guys: we've dropped support for IE6 -- sorry, SaveIE6 activists :)

Thursday, April 2, 2009

Life beyond /dev/sdz

One of the two bigger features in SSB 1.0.2 (by the way, release party scheduled for tonight!) is the support for the Sun x4540 server. It's quite an impressive piece of hardware containing two quad-core CPUs, 32 gigs of RAM and a whopping 48 SATA hard drives. It actually has huge labels printed on its 4-unit rack-mountable case warning the sysadmin that it's heavy and he should not try to lift it alone and the initial RAID syncing after a fresh install can clearly be heard well outside the server room.



It's quite fun and amazing to work on a system with such specs if you're used to SOHO-grade hardware having at most a couple of gigabytes of RAM and 1-2 terrabytes of disk space. I'd like to share an overview of some of the quirks and challenges we faced during the development.

Let's start with the size of files. PHP's filesize() fails above 2GB -- it's documented, we knew about that and it has already been worked around in SCB 1.x, so that was an easy catch. Well, it would have been, if du and ls -l returned usable values, but a bug in the cooperation of XFS and AUFS (we chose to use XFS on x4540) resulted in totally messed up (mostly zero) file sizes. It's always a joy and a real man's job to hunt down kernel-level FS errors, and it's great to have people here who can actually fix them.

Huge files can fit on huge disks, but it's nearly impossible to create a single file that occupies the whole storage, so the place to get real big numbers is when you query the total, the used and the available disk space in bytes. If you work a lot with PHP you tend to forget about the type and byte length of variables but for such big numbers you have to start to care. Life's still easier than in C as PHP automagically converts integer values to floats if it hits its limit at around 2 billion, but that also has the strange side effect that ($a + $b)/100 starts to to be not necessarily equal to $a/100 + $b/100 which can mess up calculations quite badly. Oh yeah, and displaying the free disk space in megabytes when it's around 9 TB is not too readable and JavaScript will not handle huge numbers well.

One other challenge was using RAID50 instead of a plain, simple RAID1. The beautifully hardcoded locations for the various partitions had to be replaced by screen-long allocation tables (by the way, did you know that after /dev/sdz we get /dev/sdaa, sdab, sdac etc.?), and parsing /proc/mtab to provide an easy-to-understand aggregated feedback to the user is just a wee bit harder if you got multiple RAID blocks with 10+ disks in each of them all syncing at different speeds...

...and many others, including the previously used DRBD version not supporting >8TB partitions and having problems with running out of lowmem while trying to handle all the 32GBs in the box, but I'll leave those to Bazsi and Marci, who finally found solutions for them.

After all, size does matter.