Untangling Phase and Time in Monophonic Sounds

We are looking for a mathematical model of monophonic sounds with independent time and phase dimensions. With such a model we can resynthesise a sound with arbitrarily modulated frequency and progress of the timbre. We propose such a model and show that it exactly fulfils some natural properties, like a kind of time-invariance, robustness against non-harmonic frequencies, envelope preservation, and inclusion of plain resampling as a special case. The resulting algorithm is efficient and allows to process data in a streaming manner with phase and shape modulation at sample rate, what we demonstrate with an implementation in the functional language Haskell. It allows a wide range of applications, namely pitch shifting and time scaling, creative FM synthesis effects, compression of monophonic sounds, generating loops for sampled sounds, synthesise sounds similar to wavetable synthesis, or making ultrasound audible.


INTRODUCTION
An example of our problem is illustrated in Figure 1. Given is a signal of a monophonic sound of a known constant pitch. We want to alter its pitch and the progression of its waveshape independently, possibly time-dependent, possibly rapidly. The sound must not contain noise portions such as speech does. We also do not try to preserve formants, that is, like in resampling, we accept that the spectrum of harmonics is stretched by the same factor as the base frequency. E.g. a square waveform shall remain square and so on. For some natural instruments this is appropriate (e.g. guitar, piano) whereas for other natural sounds this is inappropriate (e.g. speech).
The organisation of this article is inspired by [1]. With the paper we like to contribute the following: • In Section 2.1 we specify our problem. In Section 2.2 we propose a mathematical model for monophonic sounds given as real functions. This model untangles phase and time and allows us to describe frequency modulation and waveshape control. In Section 2.3 we show how we utilise this model for phase and time modification and we formulate natural properties of this process. • Section 3 is dedicated to theoretical details. To this end we introduce some notations and definitions in Section 3.1 and Section 3.2. We investigate the properties from Section 2. time t Figure 1: A typical use case of our method: From the above signal of a single tone we want to compute the signal below. That is, we want to alter the pitch while maintaining the progression of its waveshape and without knowing, how the signal was generated.
of simple resampling and time warping as a special case (Section 3.3.7), and we prove that our model satisfies these properties exactly. That is, our method is altogether theoretically sound. (I could not resist that pun!) As bonus we verified some of the statements using the proof assistant PVS in Section A. • The problems of handling discrete signals are treated in Section 4, including notes on the implementation in the purely functional programming language Haskell. • We suggest a range of applications of our method in Section 5. • In Section 6 you find a survey of related work and in Section 7 we compare some results of our method with the ones produced by the similar wavetable synthesis. • We finish our paper in Section 8 with a list of issues that we still need to work on.

Problem
If we want to transpose a monophonic sound, we could just play it faster for higher pitch or slower for lower pitch. This is how resampling works. But this way the sound becomes also shorter or longer. For some instruments like guitars this is natural, but for other sounds like that of a brass, it is not necessarily so. The problem we face is, that with ongoing time both the waveform and the phase within the waveform change. Thus we can hardly say, what the waveshape at a precise time point is. If we could untangle phase and shape this would open a wide range of applications. We could independently control progress of phase (i.e. frequency) and progress of the waveshape.

Model
The wish for untangled phase and shape leads us straight forward to the model we want to propose here. If phase and shape shall be independent variables of a signal, then our signal is actually a two-dimensional function, mapping from phase and shape to the (particle) displacement. Since the phase ϕ is a cyclic quantity, the domain of the signal function is actually a cylinder. For simplicity we will identify the time point t in a signal with the shape parameter. That is, in our model the time points to the instantaneous shape.
However, we never get signals in terms of a function on a cylinder. So, how is this model related to real-word onedimensional audio signals? According to Figure 2 the easy direction is to get from the cylinder to the plain audio signal: We move along the cylinder while increasing both the phase and shape parameter proportionally to the time in the audio signal. This yields a helical path. The phase to time ratio is the frequency, the shape to time ratio is the speed of shape progression. The higher the ratio of frequency to shape progression, the more dense the helix. For constant ratio the frequency is proportional to the speed with which we go along the helix. We can change phase and shape non-proportionally to the time, yielding non-helical paths.
When going from the one-dimensional signal to the twodimensional signal, there is a lot of freedom of interpretation. We will use this freedom to make the theory as simple as possible. E.g. we will assume, that the one-dimensional input signal is an observation of the cylindrical function at a helical path. Since we have no data for the function values beside the helix, we have to guess them, in other words, we will interpolate. This is actually a nice model that allows us to perform many operations in an intuitive way and thus it might be of interest beyond pitch shifting and time scaling.

Interpolation principle
An application of our model will firstly cover the cylinder with data that is interpolated from a one-dimensional signal x by an operator F and secondly it will choose some data along a curve around that cylinder by an operator S. The operator that we will work with here has the structure where κ is an interpolation kernel such as a hat function or a sinus cardinalis (sinc). Intuitively spoken, it lays the signal on a helix on the cylinder. Then on each line parallel to the time axis there are equidistant discrete data points. Now, F interpolates them along the time direction using the interpolation kernel κ. You may check that F x(t, ϕ) has period 1 with respect to ϕ. This is our way to represent the radian coordinate of the cylinder within this section. The observation operator S shall sample along a helix with time progression v and angular speed α: Interpolation and observation together, yield This operator turns out to have some useful properties: 1. Time-invariance In audio signals often the absolute time is not important, but the time differences. Where you start an audio recording should not have substantial effects on an operation you apply to it. This is equivalent to the statement, that a delay of the signal shall be mapped to a delayed result signal.
In particular it would be nice to have the property, that a delay of the input by v · t yields a delay by t of the output. However this will not work. To this end consider pure time-stretching (α = 1) applied to grains, and we become aware that this property implies plain resampling, which clearly changes the pitch. What we have at least, is a restricted time invariance: You have a discrete set of pairs of delays of input and output signal that are mapped to each other wherever the helices in Figure 2 cross, that is wher- However the construction F of our model is time invariant in the sense

Linearity
Since both F and S are linear, our phase and time modification process is linear as well. This means that physical units and overall magnitudes of signal values are irrelevant (homogeneity) and mixing before interpolation is equivalent to mixing after interpolation (additivity). Homogeneity 3. Resampling as special case We think, that pitch shifting and time scaling by factor 1 should leave the input signal unchanged. We also think, that resampling is the most natural answer to pitch shifting and time scaling by the same factor α = v. For interpolating kernels, that is κ(0) = 1, ∀j ∈ Z \ {0} : κ(j) = 0, this actually holds.
4. Mapping of sine waves Our phase and time manipulation method maps sine waves to sine waves if the kernel is the sinus cardinalis normalised to integral zeros.  Figure 3: The first graph presents the lower part of the absolute spectrum of a piano sound. Its pitch is shifted 2 octaves down (factor 4) in the second graph.
Choosing this kernel means WHITTAKER interpolation. Now we consider a complex wave of frequency a as input for the phase and time modification.
Note that for frac a = 1 2 , the WHITTAKER interpolation will diverge. If b = 0, that is the input frequency a is integral, then the time progression has no influence on the frequency mapping, i.e. the input frequency a is mapped to α · a. We should try to fit the input signal as good as possible to base frequency 1 by stretching or shrinking, since then all harmonics have integral frequency. The fact, that sine waves are mapped to sine waves, implies, that the effect of M to a more complex tone can be described entirely in frequency domain. An example of a pure pitch shift is depicted in Figure 3. The peaks correspond to the harmonics of the sound. We see that the peaks are only shifted. That is, the shape and width of each peak is maintained, meaning that the envelope of each harmonic is the same after pitch shifting.

Preservation of envelope
Consider a static wave x, i.e. ∀t x(t) = x(t + 1), that is amplified according to an envelope f . If interpolation with κ is able to reconstruct f and all of its translates from their respective integral values, then on the cylinder wave and envelope become separated and the overall phase and time manipulation algorithm modifies frequency and time separately: Examples for κ and f are: • κ being the sinus cardinalis as defined in item 4 and f being a signal bandlimited to (− 1 2 , 1 2 ), • κ = χ (−1,0] and f being constant, • κ(t) = max(0, 1−|t|) and f being a linear function, • κ being an interpolation kernel, that preserves polynomial functions up to degree n and f being such a polynomial function.

CONTINUOUS SIGNALS: THEORY
In this section we want to give proofs of the statements found in Section 2 and we want to check what we could have done alternatively given the properties that we found to be useful. You can safely skip the entire section if you are only interested in practical results and applications.

Notation
In order to give precise, concise, even intuitive proofs, we want to introduce some notations.
In signal processing literature we find often a term like x(t) being called a signal, although from the context you derive, that actually x is the signal and thus x(t) denotes a displacement value of that signal at time t. We like to be more strict in our paper. We like to talk about signals as objects without always going down to the level of single signal values. Our notation should reflect this and should clearly differentiate between signals and signal values. This way, we can e.g. express a statement like "delay and convolution commute" by (22)) which would be more difficult in a pointwise and correct (!) notation. This notation is inspired by functional programming, where functions that process functions are called higher-order functions. It allows us to translate the theory described here almost literally to functional programs and theorem prover modules. Actually some of the theorems stated in this paper have been verified using PVS [2]. For a more detailed discussion of the notation, see [3].
In our notation function application has always higher precedence than infix operators. Thus Qx → t means (Qx) → t and not Q(x → t). Function application is left associative, that is, Qx(t) means (Qx)(t) and not Q(x(t)). This is also the convention in Functional Analysis. We use anonymous functions, also known as lambda expressions. The expression x → Y denotes a function f where ∀x f (x) = Y and Y is an expression that usually contains x. Arithmetic infix operators like "+" and "·" shall have higher precedence than the mapping arrow, and logical infix operators like "=" and "∧" shall have lower precedence. That is,

Definition (Function set). With
A → B we like to denote the set of all functions mapping from set A to set B. This operation is treated right associative, that is, This convention matches the convention of left associative function application.

Basic functions
For the description of the cylinder we first need the notion of a cyclic quantity.
2 Definition (Cyclic quantity). Intuitively spoken, cyclic (or periodic) quantities are values in the range [0, 1) that wrap around at the boundaries. More precisely, a cyclic quantity ϕ is a set of real numbers that all have the same fractional part. Put differently, a periodic quantity is an equivalence class with respect to the relation, that two numbers are considered equivalent when their difference is integral. In terms of a quotient space this can concisely be written as ϕ ∈ R /Z .
3 Definition (Periodisation). Periodisation c means mapping a real value to a cyclic quantity, i.e. choosing the equivalence class belonging to a representative.
It holds c(0) = Z. We define the inverse of c as picking a representative from the range [0, 1).
In a computer program, we do not encode the elements of R /Z by sets of numbers, but instead we store a representative between 0 and 1, including 0 and excluding 1. Then c is just the function, that computes the fractional part, i.e. c t = t -floor t.
A function y on the cylinder is thus from (R × R /Z) → V , where V denotes a vector space. E.g. for V = R we have a mono signal, for V = R × R we obtain a stereo signal and so on.
The conversion S from the cylinder to an audio signal is entirely determined by given phase control curve g and shape control curve h. It consists of picking the values from the cylinder along the path that corresponds to these control curves.
For the conversion F from a prototype audio signal to a cylindrical model we have a lot of freedom. In section Section 2.3 we have seen what properties a certain F has, that we use in our implementation. We will going on to check what choices for F we have, given that these properties hold. For now we will just record, that

Time-Invariance
4 Definition (Translation, Rotation). Shifting a signal x forward or backward in time or rotating a waveform with respect to its phase shall be expressed by an intuitive arrow notation that is inspired by [4,5] and was already successfully applied in [3]: For a cylindrical function we have two directions, one for rotation and one for translation. We define analogously The first notion of time-invariance that comes to mind, can be easily expressed using the arrow notation by ∀t F (x → t) = F x → (t, c(0)). However, this will not yield any useful conversion. Shifting the time always includes shifting the phase and our notion of time-invariance must respect that. We have already given an according definition in (1) that we can now write using the arrow notation.
5 Definition (Time-invariant cylinder interpolation). We call an interpolation operator F time-invariant whenever it satisfies Using this definition, we do not only force F to map translations to translations, but we also fix the factor of the translation distance to 1. That is, when shifting an input signal x, the according model F x is shifted along the unit helix, that turns once per time difference 1.
Enforcing the time-invariance property restricts our choice of F considerably.
We see, that actually only a ring slice of F (x ← t) at time point zero is required and we can substitute , that turns a straight signal into a waveform. Now we know, that time-invariant interpolations can only be of the form The last line can be read as: In order to obtain a ring slice of the cylindrical model at time t, we have to move the signal, such that time point t becomes point 0, then apply I to get a waveform on a ring, then rotate back that ring correspondingly. We may check, that any F defined this way is indeed timeinvariant in the sense of (12).

Linearity
We like that our phase and time modification process is linear (as in (2) and (3)). Since sampling S from the cylinder is linear, the interpolation F to the cylinder must be linear as well. Homogeneity x(t) Figure 4: Constant interpolation (below) of a sine wave (above) that is out of sync. The interpolation picture represents the surface of the cylinder after cutting and flattening. A black dot means y(t, ϕ) = −1 and a white dot represents 1. The sine wave can be found in the interpolation image at the right border of each of the skew stripes. Along the vertical line from bottom to top you find the first period of the input signal, where "first" is measured from time point 0.
The properties of F are equivalent to

Static wave preservation
Another natural property is, that an input signal consisting of a wave of constant shape is mapped to the cylinder where each ring contains that waveform. A static waveform can be written concisely as w • c. It denotes the function composition of w and c, that is, w is applied to the result of c, for example . Thus w and w • c both represent periodic functions, but w has domain R /Z and thus is periodic by its type, whereas w • c is an ordinary real function, that happens to satisfy the periodicity property (w • c) = (w • c) → 1. We can write our requirement as As an example we have a constant interpolation We illustrate the constant interpolation in Figure 4, but with a sine wave, that does not have frequency 1, and thus looks for the interpolation operator F like a non-static waveform. This way, we can better demonstrate how constant interpolation works, and we think one can verify intuitively, how it preserves static waves.
We can consider an input signal of the form w • c as a wave with constant envelope and we will generalise this to other envelopes in Section 3.3.6.

Mapping of pure sine waves
We like to derive, how frequencies are mapped when converting from an audio signal to the cylindrical model and observing the signal along a different but uniform helix. To this end, we need an interpolation that maps sine waves to sine waves. Actually, the WHITTAKER interpolation has this property.
Since ϕ ∈ R /Z, when τ ∈ ϕ then τ assumes all values that differ from c −1 (ϕ) by an integer. The infinite sum . The proof of F being time-invariant according to Definition 5 is deferred to Section 3.3.5, where we perform the proof for any interpolating kernel, not just sinc1.
We will now demonstrate, that sinc1-interpolation preserves sine waves and how frequencies are mapped.
Mapping a complex sine wave to the cylinder Since exponential laws are much easier to cope with than addition theorems for sine and cosine, we use a complex wave defined by For the following derivation we need the WHITTAKER-SHANNON interpolation formula [6] in the form We choose a complex wave of frequency a as input for the conversion to the cylinder. The fractional frequency part b and the integral frequency n are chosen as in (4).
The result can be viewed in Figure 5. We obtain, that for every t the function on a ring slice ϕ → F x(t, ϕ) is a sine wave with the integral frequency n that is closest to a. That is, the closer a is to an integer, the more harmonics of a non-sine wave are mapped to corresponding harmonics in a ring slice of F x. ϕ t 0 Figure 5: The sine wave as in Figure 4 is interpolated by WHIT-TAKER interpolation. Along the diagonal lines you find the original sine wave.
Mapping a complex wave from the cylinder to an audio signal For time progression speed v and frequency α we get This proves (5).

Interpolation using kernels
Actually, for the two-dimensional interpolation F we can use any interpolation kernel κ, not only sinc1 as in (15).
The constant interpolation corresponds to κ = χ (−1,0] . Linear interpolation is achieved using a hat function. 6 Lemma (Time invariance of kernel interpolation). The operator F defined with an interpolation kernel as in (18) is time-invariant according to Definition 5. Proof.
Conversely, we like to note, that kernel interpolation is not the most general form when we only require time-invariance, linearity and static wave preservation.
The following considerations are simplified by rewriting general kernel interpolation to a more functional style using a discretisation operator and a mixed discrete/continuous convolution.
7 Definition (Quantisation). With quantisation we mean the operation that picks the signal values at integral time points from a continuous signal.
Here is, how quantisation operates on pointwise multiplied signals and on periodic signals: 8 Definition (Mixed Convolution). For u ∈ Z → V and x ∈ R → R then mixed discrete/continuous convolution is defined by We can express mixed convolution also by purely discrete convolutions: It holds because translation can be written as convolution with a translated DIRAC impulse and convolution is associative in this case (and generally when infinity does not cause problems). Thus we will omit the parentheses. We like to note, that this example demonstrates the usefulness of the functional notation, since without it even a simple statement like (22) is hard to formulate in a correct and unambiguous way. These notions allow us to rewrite kernel interpolation (18): The last line can be read as follows: The signal on the cylinder along a line parallel to the time axis can be obtained by taking discrete points of x and interpolate them using the kernel κ.

Envelope preservation
We can now generalise the preservation of static waves from Section 3.3.3 to envelopes different from a constant function.
9 Lemma. Given an envelope f from R → R and an interpolation kernel κ that preserves any translated version of f , i.e. ∀t then and only then, a wave of constant shape w enveloped by f is converted to constant waveshapes on the cylinder rings enveloped by f in time direction: Proof. 20,21,9) Now the implication (24) ⇒ (25) should be obvious, whereas the converse (25) ⇒ (24) can be verified by setting ∀ϕ w(ϕ) = 1. This special case means that the envelope f used as input signal is preserved in the sense 10 Corollary. When we convert back to a one-dimensional audio signal under the condition (24), then the time control only affects the envelope and the phase control only affects the pitch:

Special cases
As stated in item 3 of Section 2.3 we like to have resampling as special case of our phase and time manipulation algorithm. It turns out, that this property is equivalent to putting the input signal x on the diagonal lines as in Figure 4 and Figure 5. We will derive, what this imposes on the choice of the kernel κ when F is defined via a kernel as in (23).
if and only if Qκ = δ, that is, κ is a so called interpolating kernel.
Here, δ is the discrete DIRAC impulse, that is consider only t ∈ Z and rename it to For Qx = δ we get δ = δ * Qκ = Qκ.
"⇐" Conversely, every interpolating kernel κ asserts (26): Now, when our conversion from the cylinder to the onedimensional signal does only walk along the unit helix, we get general time warping as special case of our method: For h = id we get the identity mapping, for h(t) = v · t we get resampling by speed factor v.

DISCRETE SIGNALS
For the application of our method to sampled signals we could interpolate a discrete signal u containing a wave with period T , thus getting a continuous signal x with x( n T ) = u(n) and proceed with the technique for continuous signals from Section 2. However, when working out the interpolation this yields a skew grid with two alternating cell heights and a doubled number of parallelogram cells, which seems to be unnatural to us. Additionally it would require three distinct interpolations, e.g. two distinct interpolations in the unit helix direction and one interpolation in time direction. Instead we want to propose a periodic scheme where we need two interpolations with the same parameters in unit helix ("step") direction and one interpolation in the skew "leap" direction. This interpolation scheme is also time-invariant in the sense of item 1 in Section 2.3 and Definition 5 when we restrict the translation distances to multiples of the sampling period.
The proposed scheme is shown in Figure 6. We have a skew coordinate system with steps s and leaps l. We see, that this scheme can cope with non-integral wave periods, that is, T can be a fraction (in Figure 6 we have T = 11 3 ). Whenever the wave period is integral, the leap direction coincides with the time direction. The grid nicely matches the periodic nature of the phase. The cyclic phase yields ambiguities, e.g. a leap could also go to where l is placed, since this denotes the same signal value. We will later see, that this ambiguity is only temporary and will vanish at the end (29). Thus we use the unique representative c −1 (ϕ) of ϕ. To get (l, s) from (t, c −1 (ϕ)) we have to convert the coordinate systems, i.e. we have to solve the simultaneous linear equations where round is any rounding function we like. E.g. in Figure 6 it is round T = 4. Its solution is Using the interpolated input x we may interpolate y linearly or more detailed n = l · round T + s a = lerp(u(n), u(n + 1))(frac s) b = lerp(u(n + round T ), u(n + round T + 1)) (frac s) y(t, ϕ) = lerp(a, b)(frac l) .
Actually, we do not even need to compute s since by expansion of s the formula for r can be simplified and it is frac s = frac r. From l we actually only need frac l. This proves, that every representative of ϕ could be used in (27).

General Interpolations
Other interpolations than the linear one use the same computations to get frac l and r, but they access more values in the environment of n, i.e. u(n + j + k · round T ) for some j and k. E.g. for linear interpolation in the step direction and cubic interpolation in the leap direction, it is j ∈ {0, 1}, k ∈ {−1, 0, 1, 2}.

Coping with Boundaries
So far we have considered only signals that are infinite in both time directions. When switching to signals with finite time domain we become aware that our method consumes more data than it produces at the boundaries. This is however true for all interpolation methods.
We start considering linear interpolation: In order to have a value for any phase at a given time, a complete vertical bar must be covered by interpolation cells. That happens the first time at time point 1. The same consideration is true for the end of the signal. That is, our method always reduces the signal by two waves. Analogously, for k node interpolation in leap direction we lose k waves by pitch shifting.
If we would use extrapolation at the boundaries, then for the same time but different phases we would sometimes have to interpolate and sometimes we would extrapolate. In order to avoid this, we just alter any t ∈ [0, 1) to t = 1 and limit t accordingly at the end of the signal.

Efficiency
The algorithm for interpolating a value on the cylinder is actually very efficient. The computation of the interpolation parameters and signal value indices in (29) needs constant time, and the interpolation is proportional to the number of nodes in step direction and the number of nodes in leap direction. Thus for a given interpolation type, generating an audio signal from the cylinder model needs time proportional to the signal length and only constant memory additional to the signal storage.

Implementation
A reference implementation of the developed algorithm is written in the purely functional programming language Haskell [7].
The tree of modules is located at http://darcs.haskell. org/synthesizer/src/. In [8] we have already shown, how this language fulfils the needs of signal processing. The absence of side effects makes functional programming perfect for parallelisation. Recent progress on parallelisation in Haskell [9] and the now wide availability of multi-core machines in the consumer market justifies this choice.
We can generate the cylindrical wave function with the function Synthesizer.Basic.Wave.sampledTone given the interpolation in leap direction, the interpolation in step direction, the wave period of the input signal and the input signal. The result of this function can then be used as input for an oscillator that supports parametrised waveforms, like Synthesizer.Plain.Oscillator.shapeMod. By the way, this implementation again shows, how functional programming with higher order functions supports modularisation: The shape modulating oscillator can be used for any other kind of parametrised waveform, e.g. waveforms given by analytical functions. This way, we have actually rendered the tones with morphing shape in the figures of this paper. In an imperative language you would certainly call the waveform being implemented as callback function. However due to aggressive inlining the compiled program does not actually need to callback the waveform function but the whole oscillator process is expanded to a single loop.

Streaming
Due to its lazy nature, Haskell allows simple implementation of streaming, that is, data is processed as it comes in, and thus processing consumes only a constant amount of memory. If we apply our pitch shifting and time stretching algorithm to an ascending sequence of time values, streaming is possible. This applies, since it is warranted, that r T is not too far away from t. Since frac l ∈ [0, 1) it holds Thus we can safely move our focus to t·T −round T in the discrete input signal u, which is equivalent to a combined translation and turning of the wave function on the cylinder. What makes the implementation complicated is the handling of boundaries. At the beginning we limit the time parameter as described in Section 4.2. However at the end, we have to make sure that there is enough data for interpolation. It is not so simple to limit t to the length of input signal minus size of data needed for interpolation, since determining the length of the input signal means reading it until the end. Instead when moving the focus, we only move as far as there is enough data available for interpolation. The function is implemented by Synthesizer.Plain. Oscillator.shapeFreqModFromSampledTone.

Combined pitch shifting and time scaling
With a frequency control curve f and a shape control g we get combined pitch shifting and time scaling out of our model using the conversion S R f, g (see (7)).

Wavetable synthesis
Our algorithm might be used as alternative to wavetable synthesis in sampling synthesisers [10]. For wavetable synthesis a monophonic sound is reduced to a set of waveforms, that is stored in the synthesiser. On replay the synthesiser plays those waveforms successively in small loops, maybe fading from one waveform to the next one. If we do not reduce the set of waveforms, but just chop the input signal into wave periods, then apply wavetable synthesis with fading between waveforms, we have something very similar to our method. In Figure 7 we compare wavetable synthesis and our algorithm using the introductory example of Figure 1.
In this example both the wavetable synthesis and our method perform equally well. If not stated otherwise, in this and all other figures we use linear interpolation. This minimises artifacts from boundary handling and the results are good enough.

Compression
Wavetable synthesis can be viewed as a compression scheme: Sounds are saved in the compressed form of a few waves in the wavetable synthesiser and are decompressed in realtime when playing the sound. Analogously we can employ our method for compression of monophonic sounds. For compression we simply shrink the time scale and for decompression we stretch it by the reciprocal factor. An example is given in Figure 8. The shrinking factor, and thus the compression factor, is limited by non-harmonic frequencies. These are always present in order to generate envelopes or phasing effects. Consider the frequency a that is decomposed into b + n as in (4), no pitch shift, i.e. α = 1, and the shrinking factor v. According to (5), the frequency b+n is mapped to b·v+n. In order to be able to decompose b·v+n into b·v and n again on decompression, it must be b·v ∈ (− 1 2 , 1 2 ). This implies, that if b is the maximum absolute deviation from an integral frequency, that you want to be able to reconstruct, then it must be v < 1 2·b . The mapping of frequencies can be best visualised using the frequency spectrum as in Figure 9. Note how the peaks become wider by the compression factor while their shape is maintained. The resolution is divided by the compression factor, and this is why the compressed data actually consumes less space. The shape of a peak expresses the envelope of the according harmonic and widening it, means a time shrunken envelope.  Figure 9: The first graph presents the lower part of the absolute spectrum of a piano sound. This is then compressed by a factor 4 in the second graph.
If we compress too much, then peaks will overlap and we get aliasing effects on decompression. Aliasing can be suppressed by smoothing across the same phase of all waves. That is, for the monophonic sound x with period T and a smoothing filter window w, we should compress x * (w ↑ round T ) instead of x. We use the up arrow for the upsampling operator where Actually, we could use the frequency spectrum not only for visualising the compression (or pitch-shifting), but we could also use the frequency spectrum itself for compression. The advantages would be simpler anti-aliasing (we would just throw away values outside bands around the harmonics) and we could also strip high harmonics, once they fall below a given threshold. The advantage of computing in the time-domain is, that it consumes only linear time with respect to the signal length, not linear-logarithmic time like the FOURIER transform, that it can be applied in a streaming way and allows to adapt the compression factor to local characteristics of a sound. For instance, you may use a shrinking factor close to 1 for fast varying portions of the signal and use a larger shrinking factor on slowly modulated portions.

Loop sampled sounds
Another way to save memory in sampling synthesisers is to loop sounds. This is especially important in order to get infinite sounds like string sounds out of a finite storage. Looping means to repeat portions of a sampled sound. The problem is to find positions of matching sound characteristics: A loop that causes a jump or an abrupt change of the waveform is a nasty audible artifact. Especially in samples of natural sounds there might be no such matching positions, at all. Then the question is, whether the sample can be modified in a way that preserves the sound but provides fine loop boundaries. Several solutions using fading or time reversal have been proposed.
Our method offers a new way: We may move the time forth and back while keeping pitch constant. In Figure 10 we show two reasonable time control curves. Both control curves start with exactly reproducing the sampled sound and then smoothly enter a cycle. Actually, we copy the first part verbatim instead of running time stretching with factor 1, since our method cannot reproduce the beginning of the sound due to interpolation margins. The cycle of the first control curve consists of a sine, that warrants smooth changes of the time line. However with this control, interferences are prolonged at the loop boundaries, which is clearly audible. It turns out that the second control curve, namely the zig-zag curve, sounds better. It preserves any chorus effect and the change of the time direction is not as bad as expected.
A nice property of this approach is, that the loop duration is doubled with respect to the actually looped data. In contrast to that, a loop body generated by simple cross-fading of parts of the sound, say, with a VON HANN window, would half the loop body size and sounds more hectically.
Since the time control affects only the waveform, it is warranted that at the cycle boundaries of the time control the waveforms of the time manipulated sound match, too. In order to assert the also the phases match you have to choose a time control cycle length that is an integral multiple of the wave period.

Making inaudible harmonics audible
Remember, that our model does not preserve formants. Another application, where this is appropriate, is to process sounds, where formants are not audible anyway, namely ultrasound signals. Our method can be used, to make monophonic ultrasound signals audible by decreasing the pitch and while maintaining the length. In Figure 11 we show an echolocation call of a bat. It is a chirp from about 35 kHz to 25 kHz sampled at 441 kHz. The chirp nature does not match the requirements of our algorithm, so it is not easy to choose a base frequency. We have chosen 25 kHz and divide the frequency by factor 5 while maintaining the length. Unfortunately the waves have no special form that we can preserve. So this example might serve a demonstration of the robustness of our algorithm with respect to non-harmonic frequencies and the preservation of the envelope. In the same way our method might be used to increase the pitch of infrasound.

FM synthesis
Since we can choose the phase parameter per sample, we can not only do regular pitch shifting, but we can also apply FM synthesis effects [11]. An FM effect alone could also be achieved with synchronised time warping, however with our method we can perform pitch shifting, time scaling and FM synthesis in one go. See Figure 12 for an example.  Figure 12: Above is a sine wave that is distorted by v → sgn v · |v| p for p running from 1 2 to 4. Below we applied our pitch shifting algorithm in order to increase the pitch and change the waveshape by modulating the phase with a sine wave of the target frequency.

Tone generation by time stretching
The inability to reproduce noise can be used for creative effects. By time stretching we can get a tone out of every sound. This is exemplified in Figure 13. If we stretch time by a factor n for a specific period T (source and target period shall be equal), then in the spectrum the peak for each harmonic of frequency 1 T is narrowed by a factor n.

RELATED WORK
The idea of separating parameters (here phase and shape) that are in principle indistinguishable is not new. For example it is used in [12] for separation of sine waves of considerably different frequencies. This way a numerically problematic ordinary differential equation is turned into a well-behaved partial differential equation.
Also the specific tasks of pitch shifting and time scaling are addressed by a broad range of algorithms [13]. Some of them are intended for application on complex music signals and are relatively simple, like "Overlap and Add" (OLA), "Synchronous Overlap and Add" (SOLA) [14,15], or the three-phase overlap algorithm using cosine windows presented in [16]. They take segments of an audio signal as they are, rearrange them and reduce the artifacts of the new composition. Other methods are based on a model of the sound. E.g. "pitch-synchronous overlap-add" (PSOLA) is roughly based on the excitation+filter model for speech [17,18,19], sinusoidal models interpret sounds as mixture of sine waves that are modulated in amplitude and frequency [20], even more sophisticated models treat sounds as mix of sine waves, transients and a residual [21]. There are also methods specific to monophonic sig- nals, like wavetable synthesis [10] and advanced methods, that can cope with frequency modulated input signals [22].
In the following two sections we like to compare our method with the two methods that are most similar to the one we introduced here, namely with wavetable synthesis and PSOLA.

Comparison with Wavetable Synthesis
When we chop our input signal into wave periods and use the waves as wavetable, then wavetable synthesis becomes rather similar to our method [10]. Wavetable synthesis also preserves waveforms, rather than formants, it allows frequency and shape modulation at sample rate. However, due to the treatment of waveforms as discrete objects, the wavetable synthesis cannot cope well with non-harmonic frequencies ( Figure 16). Thus, in wavetable synthesisers, phasing is usually implemented using multiple wavetable oscillators. A minor deficiency is, that fractional periods of the input signal are not supported. The wavetables always have to have an integral length. We consider this deficiency to be not so important, since when we do not match the wave period exactly, this will appear to the wavetable synthesis algorithm as a shifting waveform. But that algorithm must handle varying waveshapes anyway.
The wavetables in a wavetable synthesiser are usually created by a more sophisticated preprocessing than just chopping a signal into pieces of equal length. However, for comparison purposes we will just use this simple procedure.
Chopping and subsequent wavetable synthesis can also be interpreted as placing the sample values on a cylinder and interpolating between them. It yields the pattern shown in Figure 14. The variable s denotes the "step" direction, which coincides with the direction of the phase in this scheme. The variable l denotes the "leap" direction, which coincides with the time direction. In order to fit the requirement of a wave period of 1 we shrink the discrete input signal. Say, the discrete input signal is u, the wave period is T , that must be integral, and the real input signal is x, that we define at some discrete fractional points by x( n T ) = u(n) and at the other ones by interpolation. In Figure 14 it is T = 4 and for example y(1.7, c(0.6)) is located in the rectangle spanned by the time points 6, 7, 10, 11. For simplicity let us use linear interpolation as in (28). We would interpolate y(1.7)(c(0.6)) = lerp(lerp(u(6), u(7))(0.4), lerp(u(10), u(11))(0.4))(0.7).
The handling of waveform boundaries points us to a problem of this method: Also at the waveform boundaries we interpolate between adjacent values of the input signal u. That is, we do not wrap around. This way, waveforms can become discontinuous by interpolation. We could as well wrap around the indices at waveform boundaries. This would complicate the computation and raises the question, what values should naturally be considered neighbours. We remember, that we also have the ambiguity of phase values in our method. But there, the ambiguity vanishes in a subsequent step.

Boundaries
If we have an input signal of n wave periods, then we have only n−1 sections where we can interpolate linearly. Letting alone that this approach cannot reconstruct a given signal, it loses one wave at the end for linear interpolation. If there is no integral number of waves, than we may lose up to (but excluding) two waves. For interpolation between k nodes in time direction we lose k − 1 waves. Of course, we could extrapolate, but this is generally problematic.
That is, the wavetable oscillator cuts away between one and two waves, whereas our method always reduces the signal by two waves. Thus the wavetable oscillator is slightly more economic.

Comparison with PSOLA
Especially for speech processing, we would have to preserve formants rather than waveshapes. The standard method for this application is "(Time Domain) Pitch-Synchronous Overlap/Add" (TD-PSOLA) [17,18]. PSOLA decomposes a signal into wave atoms, that are rearranged and mixed while maintaining their time scale. The modulation of the timbre and the pitch can only be done at wave rate. As for wavetable synthesis it is also true for PSOLA, that due to the discrete handling of waveforms, non-harmonic frequencies are not handled well.
Incidentally, time shrinking at constant pitch with our method is similar to PSOLA of a monophonic sound. For time shrinking with factor v and interpolating with kernel κ our algorithm computes: We see that the interpolation kernel κ acts like the segment window in PSOLA, but it is applied to different phases of the waves. For v = 1, only the non-translated x is passed to the output. Intuitively we can say, that PSOLA is source oriented or pushdriven, since it dissects the input signal into segments independent from what kind of output is requested. Then it computes, where to put these segments in the output. In these terms, our method is target oriented or pull-driven, as it investigates for every output value, where it can get the data for its construction from.
Actually, it would be easy to add another parameter to PSOLA for time stretching the atoms. This way one could interpolate between shape preservation and formant preservation.

RESULTS AND COMPARISONS
Finally we like to show some more results of our method and compare them with the wavetable synthesis.
In Figure 15 we show, that signals with band-limited amplitude modulation can be perfectly reconstructed, except at the boundaries. Although we do not employ WHITTAKER interpolation but simple linear interpolation the result is convincing.
In Figure 16 we apply our method to a sine with a frequency that is clearly distinct from 1. To a monophonic pitch shifter this looks like a rapidly changing waveform. As derived for WHIT-TAKER interpolation in (17) our method can at least reconstruct the sine shape, however the frequency of the pitch shifted signal  Figure 16: Pitch shifting performed on a sine tone with a frequency that deviates from the required frequency 1. The graphs are arranged analogously to Figure 15.
differs from the intended one. Again, the used linear interpolation does not seem to be substantially worse.
We also like to show how phase modulation at sample rate can be used for FM synthesis combined with pitch shifting. In Figure 17 we use a sine wave with changing distortion as input, whereas in Figure 18 the sine wave is not distorted, but detuned to frequency 1.2, which must be treated as changing waveform with respect to frequency 1.
As a kind of counterexample we demonstrate in Figure 19, how the boundary handling forces our method to limit the time parameter to values above 1 and thus it cannot reproduce the beginning of the sound properly. For completeness we also present the same sound transposed by PSOLA in Figure 20.
Please note that the examples have a small number of periods (7 to 10) compared to signals of real instruments (say, 200 to 2000 per second). On the one hand, graphs of real world sounds would not fit on the pages of this journal at a reasonable resolution. On the other hand, only for those small numbers of periods we get a visible difference between the methods we compare here. However, if you are going to implement a single tone pitch shifter from scratch you might prefer our method, because it handles the corner cases better and the complexity is comparable to that of the wavetable oscillator. Also for theoretical considerations we recommend our method since it exposes the nice properties presented in Section 2.

Conclusions
We shall note that despite the differences between our method and existing ones, many of the properties discussed in Section 2.3 hold approximately also for the existing methods. Thus the worth of  Figure 17: Above is a sine wave that is distorted by v → sgn v · |v| p for p running from 1 2 to 4. Below we applied our pitch shifting algorithm in order to increase the pitch and change the waveshape by modulating the phase with a sine wave of the target frequency. The graphs are arranged analogously to Figure 15.    Figure 19 that preserves formants performed by PSOLA.
our work is certainly to contribute a model where these properties apply exactly. This should serve a good foundation for further development of a sound theory of pitch shifting and time scaling. It also pays off, when it comes to corner cases, like FM synthesis as extreme pitch shifting.

Band Limitation
In our paper we have omitted how to avoid aliasing effects in pitch shifting caused by too high harmonics in the waveforms. In some way we have to band-limit the waveforms. Again, we should do this without actually constructing the two-dimensional cylindrical function. When we use interpolation that does not extend the frequency band, that is imposed by the discrete input signal, then it should be fine to lowpass filter the input signal before converting to the cylinder. The cut-off frequency must be dynamically adapted to the frequency modulation used on conversion from the cylinder to the audio signal.

Irregular Interpolation
We could also handle input of varying pitch. We would then need a function of time describing the frequency modulation which is used to place the signal nodes at the cylinder. This would be an irregular pattern and renders the whole theory of Section 3 useless. We had to choose a generalised 2D interpolation scheme.

ACKNOWLEDGMENTS
I like to thank Alexander Hinneburg for fruitful discussions and creative suggestions. I also like to acknowledge Sylvain Marchand and Martin Raspaud for their comments on my idea and their encouragement. Finally I am grateful to Stuart Parsons, who kindly permitted usage of his bat recordings in this paper. Figure 21: Excerpt from a PVS module containing two statements: The first claim is that the interpolation of the form given in (14) is time-invariant in the sense of Definition 5. The second claim is that all time-invariant interpolations can be expressed in that form. In contrast to the PVS language, the according proof script can only be understood when interactively running it step by step in PVS and looking at how the expressions evolve.
To give an impression of automated proving, we show the derivation of time-invariant interpolations from Section 3.3.1 expressed by two lemmas in PVS [2] in Figure 21. See http://darcs.haskell.org/synthesizer/src/ Synthesizer/Plain/ToneModulation/ for the according modules. The lemma, that constant interpolation preserves static waves is shown in Figure 22