I wanted to develop a formula that could suggest how to split the videos in order to minimize its transcoding time when transcoding on several machines. For this I needed some numbers.
I found a movie on Vimeo that is longer than 1 hour and downloaded it. Below you can find its parameters (reported by ffprobe):
Duration: 01:08:34.00, bitrate: 2118 kb/s
Video: h264 (High) (avc1), 1280x720, 29.97 fps
Audio: aac (mp4a), 44100 Hz, stereo, 157 kb/s
I have decided to split the whole movie into segments of roughly 60 seconds length and then split the movie into segments of roughly 120 seconds length. Afterwards I have performed two sets of operations on each set of video segments: transcoded the video segments of original picture size, and rescaled video to a smaller size (downscaled to 853x480) and then transcoded.
I've measured how much time does it take to split and join videos and realized that it does not really differ from a simple file copy. It suggests that the bottleneck of splitting and joining most probably is a hard drive, that is why from now on I will treat these operations (split and join) to be the same as a file copy operation (some numbers: splitting 1.1GB size movie took 67 seconds and copying the same movie took 68 seconds).
When I split the movie I've noticed that file size of each segment differed quite significantly. This is natural, since the level of video compression depends on the complexity of the video: the more moving and drastically changing parts in the video picture, the more it will take space on your disk. I wondered if video transcoding time depends on the file size of the segment. For this I've plotted these graphs:
Transcoding times of video segments (1 min length)
Transcoding times of video segments (2 min length)
From these graphs it is quite clear that video transcoding time does not really depend from segment file size. The major factors are video length (s) and picture size (number of pixels). Of course there are other factors too, including scene complexity, frame similarity, but I will consider only the former two. Now when we have a little understanding on what transcoding time depends, we can try to build a formula that would tell us what is an optimal number of segments for transcoding a movie in distributed fashion. Let's assume that every worker node has the same computation capabilities, then we can visualise transcoding timeline like this:
Distributed transcoding timeline
From here we can see that total transcoding time is:
Here:
is time that it takes to split the video into segments,
is the time that it takes to send one segment over the network,
is the time it takes to transcode one segment,
is the time it takes to join all transcoded segments back into a single video, and
is the number of segments.
If we add more variables and/or constants:
an original video file size,
the length of video (seconds),
disk copying speed (B/s),
transferring through network speed (B/s),
transcoded file size (B),
transcoding speed,
then we can expand the previous function like this:
Now, let's get a derivative of this function:
Since neither nor can be a negative value, it means that it is not possible to get an optimal value. In other words, the bigger , the better.
As it is visualised in the transcoding timeline (see above), master does not perform any transcoding, so all the video has to be transferred over the network. There was also an assumption made that a transcoding process is long enough so that the time it takes to transfer transcoded video over the network does not influence overall transcoding time.
Can we predict what will be the file size of a transcoded video assuming that transcoded video will almost always be encoded with H.264 and AAC? I plotted the data I have collected from couple of video transcoding sessions and derived linear functions that could be used for a file size prediction:
Video (H.264/AAC) file size dependency on video length
Here is the length of a video in minutes. Of course, we can't predict the outcome size very accurately, however, a clear linear tendency is visible in the graph, which means that we can guess what would be the transcoded video size on avarage.
Final thoughts
The formula I was trying to find did not gave any reason why it should be used. Then what should I do with it? I will have to measure what is the overhead of a master process, if it is not that big it will probably be possible to put a worker process on the same machine, thus saving the time of a network transfer. In such case the formula will look a bit more useful.
I will have to run some experiments later and compare the results to a video transcoding on one machine and to a theoretical speed-up (Amdahl's law):
Here - fraction of the calculation that must be executed serially, - number of processors.
The main idea of my future amazing system is to split video file into pieces, send them to workers in order to re-encode these pieces and then concatenate them back into a single video file. What tool will come to your mind for completing such task? For me it's FFmpeg. It's an astonishing tool for decoding, encoding, resizing and performing other manipulations on video files. You may cut and concatenate videos as well! How cool is that?
The ffmpeg that comes with Ubuntu is actually avconv. Since I wanted the true version of FFmpeg, I've first downloaded the source code:
Finally, I've made the last step in order to build everything:
make
According Y. Sambe et al. work "High-speed distributed video transcoding for multiple rates and formats", a good result can be achieved when you split the video in between the GOPs. This makes sense, since every GOP should start with an i-frame (the frame that contains all the information, not just differences between frames). But in video files Decode Time Stamp (DTS) and Playback Time Stamp (PTS) may differ which may introduce some problems. Authors state that this may lead into a situation where despite i-frame being first in the GOP it may not be the one that will be played first. They call such GOP an Open-GOP. To me it seems a bit strange. I haven't confirmed such thing yet, but it doesn't make sense to play an i-frame after the b-frame. Authors continues that because of existence of Open-GOP and several other reasons, it is good to split videos in such a way that every piece (except the first one) would have one additional GOP in the beginning (which is the last GOP of the previous piece). They did some tests and somehow showed that it does have a slight effect on the resulting video quality after transcoding process.
For testing purposes, let's try splitting videos so that every piece would contain complete GOPs. For this, we need to know how big the GOP of the video is. There is a tool called ffprobe that shows various information about streams in a video container, but to my disappointment this tool cannot show the GOP size. In order to make it show this information, I needed to add a single line of code to ffprobe.c:
staticvoid show_stream(WriterContext *w, AVFormatContext *fmt_ctx,int stream_idx){
...
case AVMEDIA_TYPE_VIDEO:
print_int("width", dec_ctx->width);
print_int("height", dec_ctx->height);
print_int("has_b_frames", dec_ctx->has_b_frames);
print_int("gop_size", dec_ctx->gop_size);// A single line is all I need...
After recompiling and then launching ffprobe, I've learned the details about my video clip:
Good, so now I know that there should be exactly two i-frames per second. This means that it should be possible to nicely split video into pieces of 2 seconds length. In order to test this little theory, I wrote a small python script that would generate me an ffmpeg command for splitting the video:
importsysif __name__=="__main__":
s="ffmpeg -i video.mp4 \n"for i inrange(0,60,2):
s+="-vcodec copy -acodec copy -ss 00:00:"+str(i).zfill(2)
s+=" -t 00:00:02 out"+str(i)+".mp4 "if i
And also a script that concatenates these pieces back into a single video file:
I've uploaded the resulting video (you can also see an original video) to YouTube:
As you can see, this method of splitting and concatenating greatly reduces the quality at splitting points (every 2 seconds). There is no degradation of quality in audio, despite this, such level of quality is unacceptable for production use.
In fact, after the video was split I was expecting that the duration of video file pieces would be exactly 2 seconds. Instead it turned out to be like this:
Björn recommended me to use libavcodec library directly instead of using ffmpeg. This sounded like a solution, so I spent a couple of days reading libavcodec code. But what I've found out is not very pleasing.
There is a function in libavcodec called av_seek_frame(). However, it is not very reliable. First, you cannot specify a frame number where you want to jump to. Moreover, according to a blog post Picture Go Back, it is not possible to reliably jump to a frame you want:
I repeatedly tried to seek forward and backwards to different frames -- frame 5000, 10,000, and 15,000 in divx, avi, and other video formats. Each time, the resulting location is close, but not exact. FFmpeg thinks it knows the frame number after seeking, but usually it is off. Frankly, when I want to jump to frame 5000, I want to be at frame 5000 and not 5015, 4079, or some other nearby frame.
So, I've just thought that maybe I can just scan the file without decoding it and check where are the beginnings of GOPs. However, I did not find any field that could provide this kind of information, but since all GOPs should start with an i-frame, I may try to just cut before each i-frame. However, I have to decode a frame in order to learn if it's an i-frame or not, and I don't really want to do that. And I really don't want to develop my own tool, because it will not stand a chance against ffmpeg in terms of supported formats, even if I use libavcodec.
And my research continues... Now, I'm thinking to look into VLC, see if it can cut accurately and if so, see if it is possible to use it as a library. Another option is to actually try to implement a new option in ffmpeg that would perform video copying, but will split video file nicely into pieces so that it would be possible to playback smoothly after joining these pieces back into a single video file.
Edit: I have to mention that I've found another way how to split videos using ffmpeg:
After splitting a video using this method and joining the pieces back together, artefacts are still created in between split points. Slight time spaces appears between each segment and the total length gets increased as well. At least no frames are dropped which means that there is probably a slight bug somewhere. It may be a good idea to report this to FFmpeg community and see what are they thinking.
To visualise the behaviour, I have performed a simple test, I've end up with a video that is 00:00:08.05 length (as reported by ffprobe. It also reported errors with STTS twice), however it actually contains around 1 minute video. What I did then is:
ffmpeg-i joined.mp4 -c copy fixed.mp4
Then ffprobe reported that a duration of a file is 00:01:07.40 (still reported STTS error once). Here is the resulting video:
Once you have decided what you want to do for your distributed systems project, you have a broad selection of tools out there that may or may not help you. Actually, there are so many, that after wasting several hours just by looking through them, you can get a headache. This is what happened to me, so if you are looking into it, let me reduce your burden by summarizing my thoughts. There are couple of approaches for developing a distributed system. First, you can use an existing framework or platform (such as the famous Apache Hadoop) for managing a big portion of work for you. In fact, that would be a preferable approach, it would help to avoid bugs and reduce development time. There are number of such frameworks to choose from:
Apache Hadoop is the blockbusting project that contains distributed file-system and map-reduce like distributed computation model. It's written in Java and therefore requires your code to be in Java too (I suppose, Scala would fit too). However, it is still possible to code in different languages once you use Hadoop Streaming API. Hadoop provides a map-reduce programming model, where the data is first devided into groups and assigned to workers and the later collected and "reduced".
Disco projectis a Hadoop MapReduce alternative developed by Nokia Research Center. It is written in Erlang, but users usually write algorithms for it in Python.
Spring Batchis part of a Spring project and used to distributed the workload across computers. It seems that it fits whenever you want to use Java EE and split the work according to master-workers programming model.
Gearmanyet another framework for developing distributed systems. May be worth to take a look.
Keep in mind that this list is very short (or maybe too short). There is a lot of research in this field, e.g. a lot of researchers try to escape map-reduce paradigm and write systems that improve performance for more difficult computational tasks. Such systems tend to use directed acyclic graphs or so. Examples of these systems would be Dryad or Spark. If you do not want to use the framework, or maybe the framework does not fit for the task you want to solve, you may build your own architecture. For this it may be a good idea to use an actor model or some kind of message passing library. A couple examples of actor model frameworks:
If nothing touches your heart then at least you may want some tools for building your network protocols. Tools that could help serialize and de-serialize objects you send over network, such as:
Protobufhas support for Java, C++ and Python and a very good documentation;
Apache Thriftsupports many languages (including Python, Ruby, Java, C++), however does not have as good documentation as protobuf has.
Up until now, I don't really know what I will be using for my project, however my heart falls down to custom protocol thingie!