June 2008

Monthly Archives

The Economics of Database Sharding

The Economics of Database Sharding

Karl Seguin of Fuel Industries makes some interesting points about the economics Database Sharding. Sequin unkindly speculated that the major database vendors have ignored Database Sharding for commercial reasons.There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.

What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.

The generated script approach to running shell commands from Java

The generated script approach to running shell commands from Java

Running external process from Java is simple enough using Runtime.exec() but there are some well documented limitations which are covered in detail in an old but still relevant JavaWorld article entitled When Runtime.exec() won’t.

I recently needed to launch a MySQL backup from Java but needed to redirect stdout and stderr to file. Unfortunately, Runtime.exec() can only be used to run an executable file and pass parameters in. It does not support more complex operations such as piping the output of one process into another process or even redirecting stdout or stderr. In other words, Runtime.exec() can’t be used as a linux command line replacement.

For example, this works correctly:

Runtime.getRuntime().exec( “/usr/bin/mysqldump mydb --result-file=mydb.dump” );

But this will not work:

Runtime.getRuntime().exec( “/usr/bin/mysqldump mydb --result-file=mydb.dump >stdout.txt 2>stderr.txt” );

This last code fragment runs without error but the stdout.txt and stderr.txt files are not created.

The standard solution to this problem is to write Java code to launch two threads, one to read the output stream from the process and one to read the error stream and then write the output from those streams to disk from within Java, but this seems like a heavyweight solution in this instance. There is also a risk of the subprocess hanging if the Java code does not read the output from the process quickly enough, as outlined in the javadocs:

“Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock.”

A simpler solution to allow arbitrary linux command lines to be run from Java is to write the command to a shell script and then execute the shell script. For example:

private static void runCommand(String cmd) throws IOException, InterruptedException {

// generate a script file containg the command to run
final File scriptFile = new File(“/tmp/runcommand.sh”);
PrintWriter w =
new PrintWriter(scriptFile);
“#!/bin/sh” );
w.println( cmd );

// make the script executable
Process p = Runtime.getRuntime().exec(
“chmod +x ” + scriptFile.getAbsolutePath() );

// execute the script
p = Runtime.getRuntime().exec( scriptFile.getAbsolutePath() );

Using this approach, I can now simply run:

runCommand(“/usr/bin/mysqldump mydb --result-file=mydb.dump >stdout.txt 2>stderr.txt”);

This approach works fine for my requirement, without the overhead of creating additional Java threads. However, this approach is not suitable if the Java application needs to read the output of the process before the process has completed, in which case the standard approach of launching threads to read the output streams should be used.

The generated script approach is very convenient for enabling general purpose linux command line usage from Java.

Wikipedia’s Scalability Architecture

Wikipedia’s Scalability Architecture

Domas Mituzas has presented Wikipedia’s scalability strategy at Velocity 2008 this week (presentation is available here). Mituzas is a Wikipedia performance engineer and database administrator and member of Board of Trustees of the Wikimedia Foundation. Mituzas is also a MySQL (now Sun) employee and was not shy about reminding people that the entire site is driven from a MySQL database.There was a big emphasis in the presentation on achiving results with minimal resources because the Wikimedia Foundation is a non-profit organization with a comparitively small budget.

The Wikipedia scalability statistics are impressive – 80,000 SQL queries per second, 18 million page objects in the English language version of the site, 220 million revisions, and 1.5 terabytes of compressed data.

Wikipedia uses Database Sharding to set up master-slave relationships between databases, which are logically based on use cases and languages. Mituzas points out that the Wikipedia team only found out that they database architecture was an example of Database Sharding after they implemented it. Mituzas said MySQL instances range from 200 to 300 gigabytes.

IONA Acquired by Progress

IONA Acquired by Progress

It’s been a big year for application development industry acquisitions – MySQL, BEA, Borland CodeGear, Cape Clear. It’s now the turn of IONA Technologies, which has been acquired by Progress Software for $148.4 million. The acquisition follows a few false starts, including a bid from Software AG.IONA’s legacy CORBA product is widely deployed in the telecommunications and financial industries, although that can not be too attractive to Progress, which already has plenty of legacy products. Perhaps the encouraging results for IONA’s new Service Component Architecture-based product called Artix influenced the acquisition?

Labels: ,

Skype 4.0 promises high definition full screen video

Skype 4.0 promises high definition full screen video

I’m looking forward to getting my hands on Skype 4.0. I wonder how well full screen video will really work on standard broadband connections. Sounds a little too good to be true. You can do full screen video with the current Skype version but the picture quality is pretty poor.

Third Installment of Database Sharding Unraveled

Third Installment of Database Sharding Unraveled

Bogdan Nicolau has published the third article in his ‘Database Sharding Unraveled” series. He makes an interesting point about planning for database scalability from the start:Before really diving into high scalability principles, I want to take a moment to talk about why database sharding has an important role even in small startups or medium sized web-sites (5 – 30k unique visitors/day).

It is equally important and benefic for a smaller web business to prepare itself from the beginning to tackle large amounts of users cheap. If it’s not obvious enough, think about what happens to a web-page that gets some plain old Digg attention. The server quickly collapses and the user experience immediately turns from positive to mega negative.

As I’ve explained before, the whole purpose of sharding is to be able to use an unlimited number of cheap machines topped by an open-source database. As experience taught me, the web server will rarely die. Instead, the DB server will choke easily when having to deal with many simultaneous connections.

The database doesn’t even have to be very big.

Bogdan’s focus is building scalable database-driven Web sites – but his comments apply to general applications as well.

A Database Sharding Plan for Twitter

A Database Sharding Plan for Twitter

There’s a very interesting post by Hank Williams called Why does everything suck?: A Detailed Five Step Twitter Scaling Plan that goes into great detail about how they can solve their database scalabilty problems using Database Sharding.

Database Sharding with Python

Database sharding with Python

I just read an interesting post over on highscalability.com. There is an early stage open source project called Pyshards that provides database sharding for Python developers. It’s interesting to see sharding toolkits emerge for languages other than Java.

I’ve been working extensively with database sharding for around 9 months now and it’s an exciting area of technology that offers a very cost-effective way to implement near-linear database scalability using commodity hardware.

The Good, the Bad, and JSF 2.0

The Good, the Bad, and JSF 2.0

The JSF 2.0 Expert Group has released an Early Access Draft of the next version of the specification, and it’s looking ugly. While it would be all to easy to provide a long list of reasons why the new version of the specification is disappointing, it should be sufficient to point to the comments on The Server Side discussion to understand why CodeFutures will not be added JSF support to FireStorm/DAO.The simple fact is that not a single FireStorm/DAO user has ever requested JSF support and there are technologies out that that are far more impressive such as Flex and some Ajax implementations.

Is Maven the right choice for your project?

Is Maven the right choice for your project?

Over the past 6 months or so I have been working with some projects that have a Maven2 build system and other projects that use plain Ant build systems. I love the concepts behind Maven and see tremendous value in standardizing the build process across projects as well as having a better way of managing dependencies on third party jars without storing them in each project’s source repository.

However, there are some pitfalls to be aware of before adopting Maven which may or may not be issues depending on your requirements.

First of all, adopting Maven will typically require a Maven server (repository) to be set up and maintained. For some smaller development departments this creates an extra IT support burden that can be avoided with an Ant build system.

Another potential issue is that third party projects might not have good support for Maven. For instance, another Apache project, Axis2, recently released version 1.4 but there is an open support ticket AXIS2-3069 regarding the Maven2 java2wsdl plugin which does not seem to be in a working state yet. The ticket has not been updated since 2007. Sure, the problem can be worked around by calling Ant from within the Maven2 project, but that’s adding another layer of complexity compared to a standard Ant project.

Another issue that I have hit is that Maven2 has poor error reporting when it is unable to resolve a dependency. I eventually tracked my issue down to a java keystore / server certificate error but Maven didn’t provide any hints that this was the problem, even with debug flags set.

Overall, I’m still a fan of Maven2 and I think the combination of Maven and Ant is extremely powerful but in my recent experience it does add extra cost to the development process.

Page 1 of 212