
Japanese/English
Original wav file : http://ccrma-www.stanford.edu/~bosse/
Pseudo code
Initialization : calculate the number of frames
REPEAT
Read wav file one frame : read 16bit signed integer PCM data
Encode one frame : transform PCM data to a character string consists of '0' and '1'.
Write mpg file one frame : write bits according to the '01' string.
UNTIL last frame
STOP
note: Without using the reservoir, a frame becomes independent of other frames, which make the encoding process far easier. Therefore the reservoir is not used here.
In the wav file, 16bit signed integer PCM (Pulse Code Modulation) data are written. In the first part, PCM data for one frame are read.
In UZURA, 16bit signed integers are returned as double precision real numbers. Because in the tables of ISO documents[1], the figures are given up to 9 digits, while the effective digits of the single real data is only up to 7 to 8.
The reason why ISO tables are given up to 9 digits is perhaps to handle 32bit integer PCM input (2^-31 = 4.66d-01).
UZURA expects 16bit PCM wav file ripped from CD.
ISO documents defines only Encode part.
In this encoding part, the PCM data are transformed into the frequency domain by applying the hybrid filter, which consists of polyphase filter bank and MDCT (Modified Discrete Cosine Transform)(array x).
Next these data are quantized (array ix).
Then the quantized data are compressed by Huffman coding.
In UZURA, the Huffman coded binaries are returned in the form of string of '0'&'1'.
In this part, the Huffman coded data for a frame are written to a file adding mpeg header and side information.
In UZURA, the string of '0'&'1' are gathered to a byte by 8 characters and written out.
Pseudo code
Polyphase filter bank : Matrix multiplication
MDCT : FFT
Bit allocation
Outer loop : Minimum problem of a equation
Inner loop : Solve an equation by iteration
Huffman code : Table look up
RETURN
PCM data are transformed to frequency domain.
Polyphase filter bank consists of 32 equal width band pass filters.
In the ISO document[1], a prototype filter is given as a table.
By shifting this, 32 band pass filters are obtained.
(Shift in the frequency domain corresponds to the multiplication of a phase factor in the time domain, that seems why it is called 'polyphase' filter bank.)
In UZURA, this is done by matrix multiplication[2].
The output of the Polyphase filter bank are far more divided in frequency.
In UZURA, MDCT of length N is implemented as FFT of length N/4[3].(proof)
FFT of base 3 is required here.
The data x(576) out of the hybrid filter are quantized as ix(576).
This routine consists of double loop structure, i.e. outer loop and inner loop.
Pseudo code
Initialization :
obtain allowable distortion
REPEAT : [Outer loop] search best scale factor
set scale factor
REPEAT : [Inner loop] decide quantization step
set quantization step
calculate required bits
UNTIL (required bits < allowed bits)
calculate distortion
UNTIL (exit condition is satisfied)
RETURN
outer loop is essentially a minimization problem of an function.
This problem is reduced to a search of a set of values in the scale factor space, which minimize a quantity 'distortion'.
Here we have to define two things.
One is a scalar value 'distortion', which mathematically is a definition of 'norm' in the x-space.
The other is a searching algorithm in the scale factor space, which may decrease the distortion.
In UZURA, the example in the ISO document is adopted.
Euclidean norm is chosen.
However it is renormalized by the width of a band of each scale factor band.
norm_long = 0.0 DO iscfb = 0, 21 tmp = 0.0 bw = iend(iscfb) - istart(iscfb) + 1 : band width DO i = istart(iscfb), iend(iscfb) tmp = tmp + | x(i) / scale_factor(iscfb) |^2 END DOnorm_long = norm_long + tmp / bw : normalize with band widthnorm_long = norm_long + tmp : non-weighted Eucledian norm END DO norm_long = SQRT(norm_long) (2002-7- 8) distortion is now scaled by 'scale_factor'
(2002-7-21) norm is now not weighted by the scalefactor band widths (due to the cahnge of the ATH function)
The square sum of quantization noise and allowed noise for each scale factor band is calculated.
If the square sum of the quantization noise for an scale factor band is larger than that of the allowed noise, the scale factor of that band is increased by 1.
DO iscfb = 0, 21 dx(iscfb) = 0.0 dt(iscfb) = 0.0 DO i = istart(iscfb), iend(iscfb) dx(iscfb) = dx(iscfb) + |x(i) - x'(i)|^2 dt(iscfb) = dt(iscfb) + |th(i)|^2 END DO IF ( dx(iscfb) > dt(iscfb) ) THEN ds(iscfb) = 1 ELSE ds(iscfb) = 0 scale_factor(iscfb) = scale_factor(iscfb) + ds(iscfb) END DOHere x' is defined as x'(i) = SIGN(ix) * |ix(i)|^(4/3) * 2^( (qquant + quantanf) / 4 ).
Basically there are two exit conditions from the outer loop.
One is, when the s-vector (a set of scale factors) in the scale factor space is converged; that is, when the condition ds(iscfb) = 0 is satisfied for all scale factor bands, while applying the above algorithm.
This means that the quantization noise becomes small enough for all scale factor bands.
The other one is, when the s-vector in the scale factor space goes out of the defined area of the ISO document, while applying the above algorithm.
In this case, the s-vector for the least distortion along the searching path is returned.
inner loop is reduced to a problem of solving an equation by iteration.
The purpose of this part is to obtain the minimum quantization step (quant), with which the required bits are less than the allowed bits.
The smaller the quantization step is, the less quantization distortion may become.
The problem which have to be solved can be written as, required_bits(quant_min) <= allowed bits.
Here, the function required_bits(quant) is globally a decreasing function.
Besides some exceptional cases, it can be said that it is monotonously decreasing function.
With this assumption, by searching quant from the small value, the above inequality is satisfied at some point and that is the value that is wanted.
In UZURA, this is solved by bi-section method.
Because the problem is an integer equality, the employed method is slightly modified from the general style.
In this part, the quantized data ix are compressed by Huffman encode method.
This is essentially a unique process of table look up.
In UZURA, this is implemented after the example in the ISO document.
From the physical restrictions of human body, there are principally audible sounds and inaudible sounds. Due to the fact that a sound masks other sounds near it both in the frequency and time domain, sounds principally audible are often unrecognized by our conscious. On the other hand, physically non-existent sound is sometimes heard by our conscious. Psychoacoustic analysis is a study of such effects. By utilizing these characteristics, the required information can be decreased by keeping the quality of sound to our conscious.
Here, minimal pschoacoustic effects required for implementation of encoder will be given. There are four things that should be decided by psychoanalysis.
The main purpose of the long/short switching is to prevent the 'pre-echo', when the sound rises up suddenly.
By reducing the block length, the propagation of the quantization noise can be shortened in the time domain, while the required bits increases in the frequency domain.
The selection of the long/short/mixed block should be decided before MDCT.
In UZURA, it should be decided in the subband base, just after the polyphase filter.
It is decided that a switch to the short block occurs when the square sum of intensities of subbands within a granule increased strongly from the previous granule.
If Sum|Subband_present|^2 > Sum|Subband_previous|^2 * switch is satisfied, the short block will be chosen.
In the masking process, sounds physically exist but psychoacoustically in audible are omitted from the data.
This may have to be done before taking NS/MS-switching.
In UZURA, only the masking by ATH(Absolute Threshold of Hearing) are taken.
This corresponds to omitting sounds that are principally inaudible.
Generally speaking, more or less same sounds reach the right and left ears.
By utilizing this correlation between LR channels, required information can be reduced.
In the MPEG/Layer 3, a transformation to the average of these channels(Mid-channel) (L+R)/SQRT(2) and to the difference from it (Side-channel) (L-R)/SQRT(2) can be used for that purpose.
The choice between the normal stereo and the MS-stereo may have to be decided before calculating allowed distortion (noise).
In UZURA, this is decided by referring annex G of the ISO document in the base after MDCT.
If Sum( ABS(|L|^2-|R|^2) ) < Sum( |L|^2 + |R|^2 ) * xms is satisfied, the MS-stereo is used.
For the purpose of the searching the best scale factor in the outer loop, the allowed distortion (noise) is required.
Therefore, this quantity should be obtained before the outer loop is called.
This quantity should be essentially the same as the masking threshold.
However good results cannot be expected by using the ATH for this purpose.
In UZURA, it is assumed that the allowed distortion is linearly proportional to the intensity in the unit of dB, and that the factors are independent of frequency.
In short, dX(dB) = A * |X| + B.
This corresponds to the limiting case, where the width of the spreading function of masking is zero.
Changing the unit from the dB, it can be rewritten as th(i) = MAX( a * |x(i)|^p, ath(i) )(a = 10^(B/20), p = A).
(2002-7- 8) modified to th(i) = a * |x(i)|^p
(2002-7-21) changed back to the original form th(i) = MAX( a * |x(i)|^p, ath(i) ) (due to new ATH)
In the program, a, p are chosen as system parameters.
Considering the effect of the ATH, an equation th(i) = MAX( a * |x(i)|^p, ath(i) ) is adopted for the estimation of the allowed distortion.
Because the use of the reservoir breaks the independence of frames, I don't think it is a good idea.
I've never considered it.
I heard that when the wave length becomes shorter than the diameter of a head, the diffraction of wave can be ignored and the stereophonic effect can be well decided by the intensity ratio between the right and left ears...but...
ATH function is often given as,
ATH(f[kHz])[dB] = 3.64 f^-0.8 - 6.5 exp(-0.6(f - 3.3)^2) + 0.001 f^4.
This function rises up at low and high frequency regions by power of f and has Gaussian dip around 3.3kHz.
(It seems that to me that this function is decided by drawing linear line on log-section-paper on both ends. And Gaussian is one of the two most famous functions with symmetrical peak line. [The other one is Lorentzian.])
It is known that this ATH function does not reproduce actual ATH.
The LAME group pointed out that encoder output is improved by replacing this ATH function with more accurate function. $B!J(Jquality - what's 'athtype 3')
In UZURA ATH function is decided according to LAME group results and my ATH measurement. (The error bar of measurement is supposed to be over 10dB.)
I am not sure how to decide the absolute value of the ATH function. It is expected that with 16bit PCM data, because of 2^-15 = 3.05d-5 = -90.3dB, the bottom of the ATH is near -90dB[3]. But I am not sure. It seems to be around -90~-120dB from experience.
Scalefactor_scale shows whether scale factors are given as powers of SQRT(2) [0] or 2 [1].
x = x * sqrt(2)^( (1 + scalefactor_scale) * scla_factor(scfb) )
When the scalefactor_scale = 0, it is possible to give fine tuning.
While when the scalefactor_scale = 1, it is possible to obtain broader dynamic range.
In UZURA, according to ISO document C.1.5.4.3, the scale factor space is searched with scalefactor_scale = 1, if convergence were not obtained, the scale factor space is searched again with scalefactor_scale = 1.
Preemphasis is defined only for long block.
It is a sort of off-sets of scale factors for high frequency bands and defined as,
xr=SIGN(ix)*|ix|^(4/3)*2^(global_gain[gr] - 210) / 4 * 2^-(scalefac_mutiplier*(scalefac_l[gr,ch,sfb] + preflag[gr,?ch?]*pretab[sfb])).
In UZURA, according to ISO document C.1.5.4.3.4, if all of the subband 17 to 20 exceeds the allowed distortions after the first call of inner loop, preemphasis is used.
subblock_gain(3) is defined only for short block.
It is a sort of scale factor for each windows and defined as,
xr=SIGN(ix)*|ix|^(4/3)*2^(global_gain[gr] - 210 - 8 * subblock_gain[gr,window]) / 4 * 2^-(scalefac_mutiplier*scalefac_s[gr,ch,sfb,window]).
In UZURA, the subblock_gain is used when the scale_factor reached maximum so as to average the intensity of the three windows would be averaged.
These procedures are in SUBROUTINE outer_loop.
[OUTER LOOP]
LOOP1 :DO scalefactor_scale = 0, 1
LOOP2 : DO subblock_gain = (0,0,0), (7, 7, 7)
LOOP3 : DO scalefactor = (0...0), (15....7...) ! scale factor loop (outer loop)
CALL inner_loop
CALL calc_distortion
save best parameters
IF (converged) EXIT LOOP1
CALL check_preemphasis_on?
IF (preemphasis_on) CYCLE LOOP3
CALL increase_scalefactor
IF (scale_factor reached max) EXIT LOOP3
END DO LOOP3
IF ( subgain reached max.) EXIT LOOP2
END DO LOOP2
IF ( scalefactor_scale ) EXIT LOOP1
END DO LOOP1
load best parameters
RETURN
There are three points to be decided about the switching of the block length.
In section 2.4.2.7 of the ISO document (p26) 'block_type', long block is defined as
In the case of long blocks (block_type not equal to 2 or in the lower subband of block_type 2 if the mixed_block_flag is set) ....And short block is defined as
In the case of short blocks (in the upper subbands of a type 2 block if the mixed_block_flag is set, or in all subbands of a type 2 block if mixed_block_flag is not set) ....
On the other hand, in section 2.4.3.4.10.1 'Alias reduction', it is written that
For long block_type granules (block_type != 2) the input to the synthesis filterbank is processed for alias reduction before processiong by the IMDCT.The definitions for the long block and short block do not match. If we follow these lines, in the case of mixed_block_flag == 1 && block_type == 2, encoders should not take anti-alias reduction on the subbband 0 and 1.
....
Alias reduction is not applied for granules with block_type == 2 (short block)....
From several reasons, however, it is more reasonable to think that the consideration of mixed block is forgotten in section 2.4.3.4.10.1, and the definition of long/short block should be the first one. Therefore anti-alias reduction should be taken at subband 0-1 in the case of mixed_block_flag == 1 && block_type == 2. With this definition, the condition for the points 2 (MDCT length) and 3 (anti-alias) become the same thing as the definition of long/short block.
Although the definition of long/short short block is enough for points 2 and 3, this is not enough for the point 1 (window selection).
If one considers a situation when switching between a normal long block (window_switching_flag == 0) and mixed block (block_type == 2 && mixed_block_flag == 1) happens, it is reasonable to apply normal window to the subband 0-1 even when block_type is start/stop.
From the definition of the long/short block, the start and stop blocks are always long block regardless of the state of the mixed_block_flag.
Considering the combination of the flags, (window_switching_flag, block_type, mixed_block_flag), it seems most reasonable to select windows as follows,
Nothing is written about VBR in the ISO document.
However, the ISO document does not forbid changing the mpeg header parameters within a file, the bit rate might be changed frame by frame as one wish.
And this seems to work.
In UZURA, bitrate is decided by the strength of 'Psychoacoustic Moment'.
'Psychoacoustic Moment (PM)' is defined as PM = Sum( f * |x(f)|^2 ) / Sum( |x(f)|^2 ), where f is frequency, x(f) spectral intensity at f.
Encoders often run short of bits, when strong peaks rise in high frequency region.
'Psychoacoustic Moment' is a quantity defined by myself for the purpose of detecting such situation.
About RIO500 tail-of-VBR-file skip bug
It is known that there is a tail-of-VBR_file-skip-bug in the firmware Ver.2.15 of the MP3 portable player RIO500 (Diamond multimedia Inc./Sonic Blue Inc.). In the course of experiments on VBR files, I found the amount of the skip is roughly proportional to the bit-rate of the first frame of the MP3 file. By forcing the bit-rate of the first frame to 32kbps, the skip can be practically ignorable. I would like to report on this in near future (maybe...).(added 2002-6-28)
It is found that the playing time of the file is decided by the bit_rate of the first frame and the size of mp3 file in RIO500 firmware 2.15 and 2.16. Therefore it is not neccesary that the bit_rate of the first frame should be 32kbps. It is enough if the bit_rate of the first frame be less than the average bit_rate of the file.I uploaded, under this directory, 14 files encoded in 128kbps besides the first frames, which are changed from 32kbps to 320kbps. One can check the above mentioned behavior by playing these files displaying "remaining time" in RIO500.
Although the ISO document is written in English, the format of figures are written in French/German style (the decimal point is comma not period). The ISO documents consists of normative part and informative part. Section 2 and Annex A,B are normative and describe decoder. Annex C describes the implementation of an encoder as a reference example.
DIST10 is the sample reference MPEG/Audio Encoder/Decoder codes by ISO. DIST10 can be found on the net easily. By reading codes without ISO document, MP1 can be understood because of its simplicity, MP2 may not be understood, and MP3 may be impossible to be understood. The main routines of MP1/MP2 encoders are written by Davis Pan, and the codes are not so much C-ish, so it's possible for non-C user to read. The MP3 encoder is wriiten by many persons and quite C-ish, it is impossible for non-C user to read.
I am indebt to many people for making UZURA.
I thank to the people at the BBS at the site held by Mr. Karajan-kyo. (Especially I am obliged to Mr. /|/|, Mr. Katajang-kyo, Mr. Shibata, Mr. Tominaga, and Mr. Nekojiro (in order of A.I.U.E.O.) in many ways.)
I thank to Mr. efu for his programs 'WaveSpectra' and ;'WaveGena'. These are quite useful tools for tuning UZURA. Without them I cannot debug UZURA at all.
I thank to Mr. Gabriel Bouvigne for uploading UZURA to his site and receiving many kind responses. Uzura is inspired by his minimal MP3 encoder Shine.
I thank to Mr. Robert Leslie for giving me information about mixed block.
-Me and UZURA and MP3 (in Japanese)
-MP1 encoder UZURA1
The relation of 'dX = AX + B' is used for allowed distortion in psychoacoustic routine.
MP1 Encoder UZURA1 for CVF
MP1 Encoder UZURA1 (Standard F90)
This page is link free.