700字范文 > Kaldi学习笔记——The Kaldi Speech Recognition Toolkit（Kaldi语音识别工具箱）（上）

Kaldi学习笔记——The Kaldi Speech Recognition Toolkit（Kaldi语音识别工具箱）（上）

时间：2022-06-11 02:19:05

最近看了有关KALDI的论文，在这里介绍一下。

Abstract:

We describe the design of Kaldi, a free, open-sourcetoolkit for speech recognition research. Kaldi provides a speechrecognition system based on finite-state transducers (using thefreely available OpenFst), together with detailed documentationand scripts for building complete recognition systems. Kaldiis written is C++, and the core library supports modeling ofarbitrary phonetic-context sizes, acoustic modeling with subspaceGaussian mixture models (SGMM) as well as standard Gaussianmixture models, together with all commonly used linear andaffine transforms. Kaldi is released under the Apache Licensev2.0, which is highly nonrestrictive, making it suitable for a widecommunity of users.

注：

1.Kaldi是免费开源的用于语音识别研究的工具包

2.finite-state transducers(FST) 是有两个tape的有限状态自动机

3.OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)

4.means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM).

I INTRODUCTION

Kaldiis an open-source toolkit for speech recognitionwritten in C++ and licensed under the Apache License v2.0.

The goal of Kaldi is to have modern and flexible code that iseasy to understand, modify and extend. Kaldi is available onSourceForge (see /). The tools compile on thecommonly used Unix-like systems and on Microsoft Windows.

注：

1.Kaldi is available onSourceForge (see /)

2.The tools compile onUnix-like systems and on Microsoft Windows.

Researchers on automatic speech recognition (ASR) haveseveral potential choices of open-source toolkits for building arecognition system. Notable among these are: HTK [1], Julius[2] (both written in C), Sphinx-4[3] (written in Java), and theRWTH ASR toolkit [4] (written in C++). Yet, our specificrequirements—a finite-state transducer (FST) based frame-work, extensive linear algebra support, and a non-restrictivelicense—led to the development of Kaldi. Important featuresof Kaldi include:

Integration with Finite State Transducers:We compileagainst the OpenFst toolkit [5] (using it as a library).

Extensive linear algebra support:We include a matrixlibrary that wraps standard BLAS and LAPACK routines.

Extensible design:We attempt to provide our algorithmsin the most generic form possible. For instance, our decoderswork with an interface that provides a score for a particularframe and FST input symbol. Thus the decoder could workfrom any suitable source of scores.

Open license:The code is licensed under Apache v2.0,which is one of the least restrictive licenses available.

Complete recipes:We make available complete recipesfor building speech recognition systems, that work from

widely available databases such as those provided by theLinguistic Data Consortium (LDC).

Thorough testing:The goal is for all or nearly all thecode to have corresponding test routines.

The main intended use for Kaldi is acoustic modelingresearch; thus, we view the closest competitors as being HTK

and the RWTH ASR toolkit (RASR). The chief advantageversus HTK is modern, flexible, cleanly structured code andbetter WFST and math support; also, our license terms aremore open than either HTK or RASR.

注：

1.Kaldi's main intend isacoustic modelingresearch

2.Advatages:modern, flexible, cleanly structured code andbetter WFST and math supportlicense terms aremore

open

The paper is organized as follows: we start by describing thestructure of the code and design choices (section II). This isfollowed by describing the individual components of a speechrecognition system that the toolkit supports: feature extraction(section III), acoustic modeling (section IV), phonetic decisiontrees (section V), language modeling (section VI), and de-coders (section VIII). Finally, we provide some benchmarkingresults in section IX.

II OVERVIEW OF THE TOOLKIT

We give a schematic overview of the Kaldi toolkit in figure1. The toolkit depends on two external libraries that are

also freely available: one is OpenFst [5] for the finite-stateframework, and the other is numerical algebra libraries. We usethe standard “Basic Linear Algebra Subroutines” (BLAS)and“Linear Algebra PACKage” (LAPACK)routines for the latter.

注：

1.external libraries：OpenFst、numerical algebra libraries

The library modules can be grouped into two distincthalves, each depending on only one of the external libraries

(c.f. Figure 1). A single module, theDecodableInterface(section VIII), bridges these two halves.

注：

1.DecodableInterfacebridges these two halves

Access to the library functionalities is provided throughcommand-line tools written in C++, which are then called

from a scripting language for building and running a speechrecognizer. Each tool has very specific functionality with asmall set of command line arguments: for example, thereare separate executables for accumulating statistics, summingaccumulators, and updating a GMM-based acoustic modelusing maximum likelihood estimation. Moreover, all the toolscan read from and write to pipes which makes it easy to chaintogether different tools.

To avoid “code rot”, We have tried to structure the toolkitin such a way that implementing a new feature will generallyinvolve adding new code and command-line tools rather thanmodifying existing ones

注：

mand-line tools written in C++ to access the library functionalities

III FEATURE EXTRACTION

Our feature extraction and waveform-reading code aims tocreate standard MFCC and PLP features, setting reasonabledefaults but leaving available the options that people are mostlikely to want to tweak (for example, the number of melbins, minimum and maximum frequency cutoffs, etc.). Wesupport most commonly used feature extraction approaches:e.g. VTLN, cepstral mean and variance normalization, LDA,STC/MLLT, HLDA, and so on.

注：

1.features: MFCC and PLP

2.feature extraction approaches:VTLN, cepstral mean and variance normalization, LDA,STC/MLLT, HLDA, and so

IV ACOUSTIC MODELING

Our aim is for Kaldi to support conventional models (i.e.diagonal GMMs) and Subspace Gaussian Mixture Models

(SGMMs), but to also be easily extensible to new kinds ofmodel.

注：

1.DIAGONAL GMMs

2.Subspace Gaussian Mixture Models(SGMMs)

A.Gaussian mixture models

We support GMMs with diagonal and full covariance structures. Rather than representing individual Gaussian densities separately, we directly implement a GMM class thatis parametrized by thenatural parameters, i.e. means timesinverse covariances and inverse covariances. The GMM classesalso store theconstantterm in likelihood computation, whichconsist of all the terms that do not depend on the data vector.Such an implementation is suitable for efficient log-likelihoodcomputation with simple dot-products.

B.GMM-based acoustic model

The “acoustic model” classAmDiagGmmrepresents a collection ofDiagGmmobjects, indexed by “pdf-ids” that correspondto context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e.GMMs). There are separate classes that represent the HMMstructure, principally the topology and transition-modelingcode and the code responsible for compiling decoding graphs,which provide a mapping between the HMM states and thepdf index of the acoustic model class. Speaker adaptationand other linear transforms like maximum likelihood lineartransform (MLLT) [6] or semi-tied covariance (STC) [7] areimplemented by separate classes.

C.HMM Topology

It is possible in Kaldi to separately specify the HMMtopology for each context-independent phone. The topology

format allows nonemitting states, and allows the user to pre-specify tying of the p.d.f.’s in different HMM states.

D.Speaker adaptation

We support both model-space adaptation using maximumlikelihood linear regression (MLLR) [8] and feature-space

adaptation using feature-space MLLR (fMLLR), also knownas constrained MLLR [9]. For both MLLR and fMLLR,

multiple transforms can be estimated using a regression tree[10]. When a single fMLLR transform is needed, it can beused as an additional processing step in the feature pipeline.The toolkit also supports speaker normalization using a linearapproximation to VTLN, similar to [11], or conventionalfeature-level VTLN, or a more generic approach for gendernormalization which we call the “exponential transform” [12].Both fMLLR and VTLN can be used for speaker adaptivetraining (SAT) of the acoustic models.

注：

1.maximumlikelihood linear regression (MLLR) & feature-spaceadaptation using feature-space MLLR (fMLLR)

E. Subspace Gaussian Mixture Models

For subspace Gaussian mixture models (SGMMs), thetoolkit provides an implementation of the approach described

in [13]. There is a single classAmSgmmthat represents a wholecollection of pdf’s; unlike the GMM case there is no class thatrepresents a single pdf of the SGMM. Similar to the GMMcase, however, separate classes handle model estimation andspeaker adaptation using fMLLR.

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。