Recently, I've done a presentation during one of classes in KTH. I've presented about DryadLINQ and I think it would be a nice idea to share the knowledge with a broader audience.
Dryad is a Microsoft's answer to nowadays very popular Hadoop MapReduce. It is designed to help programmers code data-parallel programs that scale on thousands of machines. The biggest difference between those two systems is that Dryad let's a programmer to build one's own execution graph and in MapReduce you are tied to map and reduce.
Dryad job, as I mentioned earlier, is an acyclic graph (made of vertices and edges, of course). All vertices in this graph are programs (or pieces of code) that will be running on machines. Edges are data channels which are used to transfer data between those programs. So, Dryad job can be seen as a program, and vertex - as a single execution unit of that program.
In Dryad a machine that is responsible for coordinating execution of jobs is called Job Manager (JM). It:
- instantiates a job's data-flow graph,
- schedules processes (vertices) on cluster of computers,
- provides fault-tolerance mechanisms by rescheduling failed or slow-executing jobs
- monitors the execution and collects statistics
The figure that represents Dryad architecture can be seen in slide 4. You are already introduced to the Job Manager. Name Server (NS) controls the group membership of PD nodes. The PD node is an execution daemon that executes actual vertices. They all communicate between themselves through a data plane. Here come an interesting part. In Dryad, vertices can pass the data in several ways:
- Through files. They can store files locally so that other vertex that will be executed on the same node could read that file.
- Through TCP. PD transfers data to other executing vertex on other PD.
- Through in-memory data structure. If the data fits in memory, why not to store it there, so that other vertex could access it in a faster manner?
OK, so this sounds nice on a paper. However, Dryad (just like MapReduce) runs into the problem of complexity. A programmer has to build one's own DAG (directed acyclic graph, or Dryad job) and code it. This may sound simple, but it isn't. Since programmers already know SQL pretty well, there are a number of solutions for MapReduce that implement SQL-like syntax on top of MapReduce. Several examples are Hive and Pig. They're main goal is to make programmer's life easy and let them code in simple SQL. The system then will be responsible for executing this query efficiently on a cluster of machines. Nevertheless, pure SQL has several issues, that is, it does not have custom types (like structs), no common programming patterns (iterations, conditional clauses), and so on. Microsoft had already made a solution to these problems: LINQ. So, why not to make LINQ for Dryad? This would definitely have a couple of advantages, since LINQ:
- is already integrated in C#, you don't have to write SQL parsers;
- can provide SQL-like syntax, it's simple;
- has common programming patterns, such as iteration;
- can provide custom types (.NET objects);
- has strong Visual Studio support.
Indeed, this is what Microsoft guys did: DryadLINQ. This system compiles LINQ query into execution plan and runs it on Dryad. But before jumping into details, go and see slides 9 and 10. You will see what changes are needed for LINQ in order to transform it to DryadLINQ. GetTable and ToDryadTable are essential function calls in DryadLINQ. The former shows where Dryad should look for input and of what .NET type that input should look like. The latter shows where should the output be stored and actually starts the execution. Now let's talk about DryadLINQ execution. When ToDryadTable is initialized DryadLINQ takes LINQ expression and compiles it to Dryad execution plan, i.e. decomposes expressions to sub-expressions, generates code and static data for the Dryad vertices, generates object serialization code. Then DryadLINQ passes this plan to Dryad Job Manager (they are using a custom Job Manger, actually). It transforms execution plan to a Job Graph and schedules vertices execution on PDs. When vertices are executed they take all needed input from Input Tables (data store that was provided with a call to GetTable). When execution is over, the output is then written to Output Tables (data store that was provided with a call to ToDryadTable). DryadLINQ then wraps this Output Table and provides to application an iterator from which an application can get regular .NET objects. DryadLINQ also does some optimizations. The most important static optimizations are:
- Pipelining. When multiple operators may be executed in a single process it is better to executed in pipeline.
- Removing redundancy. Removal of unnecessary hash or range partitioning steps.
- Eager Aggregation. If it's needed to aggregate, it is better to aggregate part of the data before sending it through network.
- I/O reduction.When data is small and fits in memory, it is better to pass the data to other vertex through in-memory FIFO channels.
The system also provides dynamic optimizations. These are run-time optimizations that depends on the size of the data. This includes aggregation tree generation, data partitioning. And that's it, folks! This is a brief introduction to DryadLINQ. Now should you jump into it and start using it? Well, ZDNet announced that Microsoft will no longer offer Dryad services in their cloud services. And instead it will be using Hadoop. I would reason that they did this choice because Hadoop is much more popular and it's much easier to attract clients when offering Hadoop. The rest is up to you!