Glass, Metal and Silicium

AAC in FFMPEG

Thursday, August 21, 2008, 04:20 PM - Audio coding

There is a lot of activity regarding AAC in FFMPEG since July.

The AAC decoder from Summer Of Code (SOC) 2006 is now included within the official FFMPEG tree, after cleanups by Robert Swain. It an AAC-LC and AAC-Main profiles decoder, so it can not yet replace FAAD/FAAD2 as it doesn't decode SBR and PS (used within HE-AAC and HE-AACv2), but it's a good step in the right direction. There is also an SBR decoder in the work, so we will perhaps see this in a few months.

On the encoder side, Konstantin Shishkov has been working on an AAC-LC encoder this summer. While it is not yet competitive with the best AAC-LC encoders, it already provides a good features set, and best of all, it features a psychoacoustic model (it should be the first psychoacoustic model to hit FFMPEG). The psy model is inspired from the 3GPP model, which is a stripped down encoder from Coding Technologies, provided with some basic documentation of the algorithms.

| 0 trackbacks | permalink | related link

Is Vorbis really a standard?

Tuesday, December 11, 2007, 11:45 AM - Audio coding

Until now, Ogg/Theora and Ogg/Vorbis were included within the HTML 5 recommendation as a feature that "should" be supported:

"User agents should support Ogg Theora video and Ogg Vorbis audio, as well as the Ogg container format."

Nokia just issued a position paper about it, recommending for a removal of this part from the HTML 5 draft, until further clarifications. This caused a lot of noise, with several people questioning this move.

While I agree that there are some strange points within Nokia's position paper, I also have to agree on the fact that W3C should not substitute itself over ISO, ITU, or SMPTE regarding Audio and Video coding standards, and that neither Vorbis or Theora can really be considered to be some standards at this point.

For sure, the current Vorbis specs would probably not be accepted by the usual standardization bodies. Most of the issues would be easily solved, but that is, to me, a demonstration of the usefulness of ISO/ITU/SMPTE standardization processes.

As I'm not fluent with the Theora specs, I'm just going to have a quick look at current Vorbis specifications issues:

*Is the current spec a draft or a final document?

Right now, it's mentioned to be 0.8, without any mention of "freezing" date. It's very likely to be a final document, but why isn't it mentioned anywhere?

*Lack of proper external references.

Granted, this is a very minor issue, but why is there no consistent mention of external references? In section "1.3.2.3. Window shape decode" there is a proper reference to an external paper, but section 7.1 is referring to Bresenham’s algorithm, without any external reference. Any reader interested in Vorbis spec is very likely to be aware of this algorithm, but this is just an example of a place where external references should be added.

*Lack of profiles

A few things within the Vorbis specs can be problematic for implementation within heavily constrained environments. This could be solved by having profiles within the specs, and is acknowledged by section "1.1.6. Hardware Profile".
The problem is that this hardware profile is not actually defined within the standard.

*Codebook transmission

Vorbis mandates the use of custom codebooks, which have to be transmitted as side information in a way or another. That is a great flexibility, but in the real world, most cases could happily be covered by a standard set of codebooks. Having a base profile with fixed codebooks would allow to store those in ROM instead of RAM, which could be of great value for some embedded devices.

*MDCT window size

According to Vorbis specs, a stream must feature two different windows sizes, which is a usual feature for such an audio format. The problem is that the sizes can be 64, 128, 256, 512, 1024, 2048, 4096 or 8192 samples, instead of only 2 fixed sizes. That is an additional complexity that is mostly unneeded, versus having only 2 sizes, which would ease code optimizations, and would not go up to the 8192 size that could sometimes be stressful regarding memory use (and memory transfers).
Once again, this issue should be taken care of by defining some profiles.

*The two types of floor curves

As defined by section 4.2.4.3, Vorbis allows two different kind floor curves. However, only 1 of those is really used by all the current Vorbis encoders. Why not just keeping one, and eventually put the other type within an "extended" profile?

*Floor+residue dynamic range

As mentioned within the specs, some parts of decoding (see section 1.3.2.8) can not be done using 32bits fixed point. You either need a movable binary point (quite inefficient) or use 64bits fixed point (64bits computations also being quite inefficient on many architectures). This should really be addressed in a way or another, either by profiles, or "limited accuracy decoders" defined by the specs.

*No maximum packet size

According to section 1.1.3, there is no maximum packet size. While this will work considering a personal computer, once you go into embedded devices, there is no such thing as unlimited memory, especially if you are considering the local memory of your processor/dsp/whatever.
Moreover, if there is no maximum packet size or maximum bitrate, how do you test performance of your decoder?

*No constant bitrate mode

The specifications do not define any constant bitrate (CBR) mode. That would not be that hard to do, all that is needed is defining a coding buffer/reservoir, and allow a way to signal its current level (could be on the transport level). In many streaming cases, CBR or "capped VBR" (ie VBR with a maximum bitrate, considering a given buffer size) is a necessity.

*A standard should, as much as possible, be set in stone

"However, the Xiph.org Foundation and the Ogg project (xiph.org) reserve the right to set the Ogg Vorbis specification and certify specification compliance."
Sounds a bit frightening. Please, mention somewhere that the standard is in final stage, and won't be changed at latter stage.

| 0 trackbacks | permalink | related link

Quad cores benchmarks and LAME

Tuesday, November 7, 2006, 12:39 PM - LAME

Some samples of Intel "quad cores" processors are now available. As with the introduction of dual cores, we'll see a lot of hardware websites doing some benchmarks. Several of them will also use LAME encoding as one of the tests.

In the current version (3.97) LAME is not multithreaded at all, so for a single encoding it's obvious that adding cores will not change anything, no need to run a benchmark to know it, and no need to display bar graphs of encoding speed results (as they are quite pointless).

Message to benchmarkers:

If you want to use mp3 encoding for a multicores/multiprocessors benchmarks, then you have the following choices:

*use a natively multithreaded encoder (example: iTunes)

*use a multithreaded frontend and encode several files. As an example, EAC allows you to choose the number of threads. If you use more than 1, of course you will benefit from the added cores.

| 0 trackbacks | permalink | related link

LAME encoding speed evolution

Saturday, October 28, 2006, 01:27 PM - LAME

For the sake of curiosity, I tested the evolution of encoding speed of Lame using different versions. To simplify things, encoding is done at default settings (128kbps/cbr). Computer features MMX/3dnow/SSE/SSE2, audio file is about 1 minute long.

Here are the results:

3.20: 3.51s
3.50: 6.49s
3.70: 3.78s
3.80: 3.70s
3.90: 3.74s
3.93.1: 4.58s
3.96.1: 5.53s
3.97: 5.31s

By looking at this, it's clear that overall Lame has became slower over time. We tryed to keep speed reasonable, but as we increased quality our speed optimisations were not enough to compensate for the extra computations.

We had a notable speed decrease when releasing 3.94. In 3.94, we switched our default psychoacoustic model from GPsycho to NSPsytune. NSPsytune was already used (starting with 3.90) when using the preset/alt-preset settings. To have a comparison value, here is the "preset cbr 128" speed of versions prior to 3.94:

3.90 --alt-preset cbr 128: 9.62s
3.93.1 --preset cbr 128: 6.55s

So between 3.90 and 3.97, NSPsytune's encoding time decreased from 9.62s to 5.31s. Not bad at all, but still quite slower than early Lame releases.

| 0 trackbacks | permalink | related link