# torchaudio-contrib

**Repository Path**: rsmeng/torchaudio-contrib

## Basic Information

- **Project Name**: torchaudio-contrib
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-03-24
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# torchaudio-contrib

Goal: To propose audio processing Pytorch codes with nice and easy-to-use APIs and functionality.

:open_hands: This should be seen as a community based proposal and the basis for a discussion we should have inside the pytorch audio user community. Everyone should be welcome to join and discuss.

Our motivation is:

  - API design: Clear, readible names for class/functions/arguments, sensible default values, and shapes.
      - Reference: [librosa](http://librosa.github.io/librosa/) (audio and MIR on Numpy), [kapre](https://github.com/keunwoochoi/kapre) (audio on Keras), [pytorch/audio](https://github.com/pytorch/audio) (audio on Pytorch)
  - Fast processing on GPU
  - Methodology: Both layer and functional
    - Layers (`nn.Module`) for reusability and easier use
    - and identical implementation with `Functionals`
- Simple installation
- Multi-channel support

## Contribution

Making things quicker and open! We're `-contrib` repo, hence it's *easy to enter but hard to graduate*. 

 1. Make a new [Issue](https://github.com/keunwoochoi/torchaudio-contrib/issues) for a potential PR
 2. Until it's in a good shape,
    1. Make a PR with following the current conventions and unittest
    2. Review-merge.
 3. Based on it, make a PR to [torch/audio](https://github.com/pytorch/audio)  
 
 
 Discussion on how to contribute - https://github.com/keunwoochoi/torchaudio-contrib/issues/37

## Current issues/future work
- Better module/sub-module hierarchy
- Complex number support
- More time-frequency representations
- Signal processing modules, e.g., vocoder
- Augmentation

# API suggestions

## Notes
  * Audio signals can be multi-channel
  * `STFT`: short-time Fourier transform, outputing a complex-numbered representation
  * `Spectrogram`: magnitudes of STFT
  * `Melspectrogram`: mel-filterbank applied to `spectrogram`

## Shapes
  * audio signals: `(batch, channel, time)`
      * E.g., `STFT` input shape
      * Based on `torch.stft` input shape
  * 2D representations: `(batch, channel, freq, time)`
      * E.g., `STFT` output shape
      * Channel-first, following torch convention.
      * Then, `(freq, time)`, following `torch.stft`


## Overview
### `STFT`
```python
class STFT(fft_len=2048, hop_len=None, frame_len=None, window=None, pad=0, pad_mode="reflect", **kwargs)
def stft(signal, fft_len, hop_len, window, pad=0, pad_mode="reflect", **kwargs)
```

### `MelFilterbank`
```python
class MelFilterbank(num_bands=128, sample_rate=16000, min_freq=0.0, max_freq=None, num_bins=1025, htk=False)
def create_mel_filter(num_bands, sample_rate, min_freq, max_freq, num_bins, to_hertz, from_hertz)
```

### `Spectrogram`
```python
def Spectrogram(fft_len=2048, hop_len=None, frame_len=None, window=None, pad=0, pad_mode="reflect", power=1., **kwargs)
```
Creates an `nn.Sequential`:
```
>>> Sequential(
>>>  (0): STFT(fft_len=2048, hop_len=512, frame_len=2048)
>>>  (1): ComplexNorm(power=1.0)
)
```

### `Melspectrogram`
```python
def Melspectrogram(num_bands=128, sample_rate=16000, min_freq=0.0, max_freq=None, num_bins=None, htk=False, mel_filterbank=None, **kwargs)
```
Creates an `nn.Sequential`:
```
>>> Sequential(
>>>  (0): STFT(fft_len=2048, hop_len=512, frame_len=2048)
>>>  (1): ComplexNorm(power=2.0)
>>>  (2): ApplyFilterbank()
)
```

### `AmplitudeToDb`/`amplitude_to_db`
```python
class AmplitudeToDb(ref=1.0, amin=1e-7)
def amplitude_to_db(x, ref=1.0, amin=1e-7)
```
Arguments names and the default value of `ref` follow librosa. The default value of `amin` however follows Keras's float32 Epsilon, which seems making sense.

### `DbToAmplitude`/`db_to_amplitude`
```python
class DbToAmplitude(ref=1.0)
def db_to_amplitude(x, ref=1.0)
```

### `MuLawEncoding`/`mu_law_encoding`
```python
class MuLawEncoding(n_quantize=256)
def mu_law_encoding(x, n_quantize=256)
```

### `MuLawDecoding`/`mu_law_decoding`
```python
class MuLawDecoding(n_quantize=256)
def mu_law_decoding(x_mu, n_quantize=256)
```

----------

# A Big Issue - Remove SoX Dependency

We propose to remove the SoX dependency because:

* Many audio ML tasks don’t require the functionality included in Sox (filtering, cutting, effects)
* Many issues in torchaudio are related to the installation with respect to Sox. While this could be simplified by a [conda build or a wheel](https://github.com/pytorch/builder/issues/279), it will continue being difficult to maintain the repo.
* SOX doesn’t support MP4 containers, which makes it unusable for multi-stream audio
* Loading speed is good with torchaudio but e.g. for __wav__, its not faster than other libraries (including cast to torch tensor) -- as in the graph below. See more detailed benchmarks [here](https://github.com/faroit/python_audio_loading_benchmark).

![](https://raw.githubusercontent.com/faroit/python_audio_loading_benchmark/master/results/benchmark_pytorch.png)

## Proposal

Introduce I/O backends and move the functions that depend on `_torch_sox` to a `backend_sox.py`, which is *not* required to install. Additionally, we could then introduce more backends like scipy.io or pysoundfile. Each backend then imports the (optional) lib within the backend file and each backend includes a minimum spec such as:

```python
import _torch_sox

def load(...)
    # returns audio, rate
def save(...)
    # write file
def info(...)
    # returns metadata without reading the full file  
```

### Backend proposals

* `scipy.io` or `soundfile` as default for __wav__ files
* `aubio` or `audioread` for __mp3__ and __mp4__


### Installation

```bash
pip install -e .
```


### Importing

import torchaudio_contrib


## Authors
Keunwoo Choi, Faro Stöter, Kiran Sanjeevan,  Jan Schlüter