部标 JT/T 1078 协议双向对讲实现

####前言
首先，说明一下，我是参考了公司买的第三方C#代码，主体思路和我差不多，只不过其中涉及到的一些音频编解码问题是我所没有接触过的，所以在这里我只提供思路以及一些可用测试源码，因为涉及到的东西太多，这次我也不可能将一个完整能直接跑的程序分享出来了，所以具体还待大家自己去实现。
####分析
一开始接到这个需求时，查看 1078 协议文档发现，只是在文档中简单介绍了实时音视频传输的指令，以及设备收到指令后上传的音视频 RTP 负载包格式。
什么？只有设备上传音视频相关的文档说明，既然是对讲，那么麦克风语音如何传给设备呢？
个人觉得，这些部标协议用到的技术都比较老旧了，比如检索设备视频并上传文件，竟然还用 FTP？现在一般都是把文件存到第三方的对象存储服务了，比如：阿里云的OSS。用 FTP 的话我们就有可能需要在 FTP 那边接收到文件然后又转存一次存到 OSS，岂不是多此一举？实时音视频传输也是一样的，为什么不用第三方的流媒体服务器？非要传RTP包然后再转发一次到流媒体服务器？
废话不多说了，之前说到麦克风语音如何传给设备，文档里没有说明，那我们就去猜吧，实际上传输和设备传给视频转发服务器一样，也是用的文档中的 RTP 负载包格式。
对讲这一块我就不讲视频流的问题了，只讲传输音频流遇到的各种坑吧。
首先，先从调试`设备音频->视频转发服务器`开始。下面是设备厂商给的音频流文档：
```
MDVR终端音频参数配置说明

采样率：8K
采样精度：16位
采样点：320

编码格式：
	AUDIO_CODEC_G711A,
	AUDIO_CODEC_G726_MEDIA_40K,
	AUDIO_CODEC_G726_MEDIA_32K,
	AUDIO_CODEC_G726_40K,
	AUDIO_CODEC_G726_32K,
可以选以上几种编码格式。

重要提示：
1、由于采用海思芯片编码，每帧音频帧都会在标准音频帧前面增加4个字节的音频帧头信息，平台解码时需要去掉前面4个字节。
2、G726编码格式含有多种压缩率格式，很多标准播放器的解码库只支持AUDIO_CODEC_G726_MEDIA_40K和AUDIO_CODEC_G726_MEDIA_32K两种编码格式，不支持AUDIO_CODEC_G726_40K和AUDIO_CODEC_G726_32K，虽然看上去压缩率一样，但是编码规则不一样，需要留意音频解码参数的配置。
3、相对G726编码而言G711A编码则比较简单，调试时可以先选择这种编码格式。
4、举例：采样率8K，采样精度16位，采样点：320
原始音频数据码率：8K*16bit=128kpbs
每秒音频帧数：8K/320=25帧
比如选择AUDIO_CODEC_G726_40K编码，则每帧长度40Kbit/8bit/25=200字节长度，既每帧编码后的音频帧长度是200个字节，那么设备上传的就是204个字节，前面4个字节是海思音频编码私有帧头数据。
```
测试时，只测试了`G726_32K`和`G726_40K`的，音频流编码格式在设备上用遥控器修改。
由于之前一开始没接触过音频的相关知识，有点异想天开了：收到音频流就直接往 RTMP 流媒体服务器上去发，根本就没管什么音频编码格式。
`G726_32K`和`G726_40K`这什么音频格式？以前只见过一些常见的格式，比如：mp3、ogg、aac。说实话说不出第四个了。。。
网上搜索也没见有人直接用这两种音频格式直接推流的。
既然没有，那就转吧，转成AAC，因为网上关于AAC的资料还是挺多的，转AAC的步骤如下：
 1. 解码`G726_32K`或者`G72640K`得到`PCM`音频数据
 2. 将`PCM`数据编码成`AAC`（经过这段时间的摸索，发现很多的音频之间的转换都是基于PCM，貌似PCM就像是音频的原始数据一样，不同音频格式直接的转换都是经过PCM的中转）

[G726 到 AAC 的转换可以参考这个开源项目][1]

`G726`转`AAC`后就是推流到 RTMP 流媒体服务器了，这块没什么太多的问题，不再过多赘述。

下面重点讲下平台麦克风语音到设备的传输：
由于我们做的是 WEB 平台，所以对讲是基于网页的，一开始说要在网页上做对讲也是觉得不可思议，最开始想到的不是用 1078 协议来实现，由于公司早期也研发了自己的设备，所以在自己的设备上加新功能也是比较灵活，不必按照部标协议来搞，所以最初的想法就是：
网页推流麦克风声音到 RTMP 流媒体服务器 -> 下发命令给设备通知设备拉流获取音频流并播放
结果发现就是效果并不是很好，因为设备那边找的 rtmp 播放器都不能设置 0 延迟播放。

后来再次接到要用 JT/T 1078 协议来搞的时候，也是首先想到了网页推流到 RTMP 流媒体服务器，然后视频转发服务器拉流获取音频数据，最后再转发给设备。

整个过程，就又涉及到了音频流的编解码的问题了。
首先，网页推流到 RTMP 流媒体服务器，只能选择用 flash 了，而 flash 推流到 RTMP 又有两种：Nellymoser、Speex。
这里我们选择 Speex，flash 播放器需要对其设置，否则默认是`Nellymoser`：
```
mic = Microphone.getMicrophone();
mic.encodeQuality = 6;			//选择默认音频质量(不选默认视频服务器解码成pcm时貌似有点问题)
mic.rate = 8;				//选择默认采样率(speex默认16,rate的修改不起作用)
mic.codec = SoundCodec.SPEEX;		//选择speex编码
mic.noiseSuppressionLevel = 0;		//降噪
mic.setLoopBack(false);			//消除回音
mic.framesPerPacket = 1;		//由于视频转发服务器拉流时每次只能解码一帧speex，所以每次只编码一帧speex
```
上面需要注意的是因为设备与视频转发服务器之间传输的音频流每次是两帧（根据厂商文档中的采样点为320以及每次传输G726编码字节大小可推出），所以`framesPerPacket`需要设置成 1 或者 2（由于我这里的`speex`解码为`pcm`不支持多帧解码，所以我设置成了1），否则就可能会遇到一些问题，比如一次传了多帧，怎么把每帧分割出来（每帧的长度是不固定的，可以参考[这篇文章][2]）

RTMP 中的 Speex 音频流拉取我们采用`librtmp`来实现，因为它每次读取的都是 RTMP 中一帧（framesPerPacket不设置为1，那么 RTMP 的一帧可能就包含多帧 speex）的数据，比较好处理数据，采集数据格式如下：
![微信截图_20181228114025.png][3]

16进制后的乱码不必在意，只是代码输出有点问题，可以看到在麦克风没有收到讲话声音时传输的 RTMP 每帧长度是 26，而根据上面那篇文章介绍，`B2`后的 10 字节数据就是`speex`音频帧，前后一些无关紧要的字节个数都是固定的，所以就很好提取出`speex`数据了。
同理当有人讲话时，拉取到的 RTMP 帧长度是68，减去无关紧要的 15 字节，刚好是 43 个字节，再去掉`B2`头一个字节，那么`speex`音频长度就是42。
网上搜了很多资料，都不知道`speex`解码时需要传多少个字节，有人说是20，也有人说`PCM`编码成`speex`后长度是38，但是试了根本不是这样子的，我们只需按照从 RTMP 中提取的一帧 Speex 长度来解码就行了，也就是拉取一帧就解码一帧。
拉取`Speex`音频流C语言代码如下（转自 雷霄骅 文章）：
```C
// LibRtmpTest.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。
//

#include "pch.h"
#include "librtmp/rtmp_sys.h"
#include "librtmp/log.h"
#include <stdbool.h>

int InitSockets()
{
#ifdef WIN32
	WORD version;
	WSADATA wsaData;
	version = MAKEWORD(1, 1);
	return (WSAStartup(version, &wsaData) == 0);
#endif
}

void CleanupSockets()
{
#ifdef WIN32
	WSACleanup();
#endif
}

void ByteToHexStr(const unsigned char* source, char* dest, int sourceLen);

int main()
{
	InitSockets();

double duration = -1;
	int nRead;
	//is live stream ?
	bool bLiveStream = true;

int bufsize = 1024 * 1024 * 10;
	char *buf = (char*)malloc(bufsize);
	memset(buf, 0, bufsize);
	long countbufsize = 0;

FILE *fp = fopen("C:\\Users\\Administrator\\Desktop\\abc.spx", "wb");
	if (!fp) {
		RTMP_LogPrintf("Open File Error.\n");
		CleanupSockets();
		return -1;
	}

/* set log level */
	//RTMP_LogLevel loglvl=RTMP_LOGDEBUG;
	//RTMP_LogSetLevel(loglvl);

RTMP *rtmp = RTMP_Alloc();
	RTMP_Init(rtmp);
	//set connection timeout,default 30s
	rtmp->Link.timeout = 10;

// HKS's live URL
	if (!RTMP_SetupURL(rtmp, "rtmp://192.168.0.6:19350/live/015986787373"))
	{
		RTMP_Log(RTMP_LOGERROR, "SetupURL Err\n");
		RTMP_Free(rtmp);
		CleanupSockets();
		return -1;
	}
	if (bLiveStream) {
		rtmp->Link.lFlags |= RTMP_LF_LIVE;
	}

//1hour
	RTMP_SetBufferMS(rtmp, 3600 * 1000);

if (!RTMP_Connect(rtmp, NULL)) {
		RTMP_Log(RTMP_LOGERROR, "Connect Err\n");
		RTMP_Free(rtmp);
		CleanupSockets();
		return -1;
	}

if (!RTMP_ConnectStream(rtmp, 0)) {
		RTMP_Log(RTMP_LOGERROR, "ConnectStream Err\n");
		RTMP_Close(rtmp);
		RTMP_Free(rtmp);
		CleanupSockets();
		return -1;
	}

while (nRead = RTMP_Read(rtmp, buf, bufsize)) {
		//fwrite(buf, 1, nRead, fp);

char *hex;

hex = (char*)malloc(nRead*2);
		memset(hex, 0, nRead*2);
		ByteToHexStr(buf, hex, nRead);

countbufsize += nRead;
		RTMP_LogPrintf("Receive: %5dByte, Total: %5.2fkB\n", nRead, countbufsize*1.0 / 1024);
		printf(hex);
	}

if (fp)
		fclose(fp);

if (buf) {
		free(buf);
	}

if (rtmp) {
		RTMP_Close(rtmp);
		RTMP_Free(rtmp);
		CleanupSockets();
		rtmp = NULL;
	}
	return 0;
}

void ByteToHexStr(const unsigned char* source, char* dest, int sourceLen)

{
	short i;
	unsigned char highByte, lowByte;

for (i = 0; i < sourceLen; i++)
	{
		highByte = source[i] >> 4;
		lowByte = source[i] & 0x0f;

highByte += 0x30;

if (highByte > 0x39)
			dest[i * 2] = highByte + 0x07;
		else
			dest[i * 2] = highByte;

lowByte += 0x30;
		if (lowByte > 0x39)
			dest[i * 2 + 1] = lowByte + 0x07;
		else
			dest[i * 2 + 1] = lowByte;
	}
	return;
}
```
如果你的视频转发服务器不是C或者C++的，那就把上面代码封装成一个DLL提供接口给其他语言调用吧。

`Speex`转码成`PCM`也有需要注意的地方，因为根据上面我们厂商文档中说明，采样率是8K，采样点320，所以可以知道传输的每帧`PCM`字节数是320，对应的是`Speex`的窄带（160 sample，一个 sample 占两个字节，所以解码出的 PCM 刚好是 320 字节），那么我们在解码`Speex`为`PCM`的时候就应该选择窄带解码模式。

这里是我找到的一个 C# 的`Speex`编解码封装类：
```C#
#region Using directives

using System.Runtime.InteropServices;
using System;

// using System;
// using System.IO;
// using System.Collections.Generic;
// using System.Text;
// using System.Runtime.InteropServices;
// using System.Diagnostics;

#endregion Using directives

namespace GpsNET.RTMP
{
    public enum SpeexCtlCode {
        // Set enhancement on/off (decoder only)
        SPEEX_SET_ENH=0,
        // Get enhancement state (decoder only)
        SPEEX_GET_ENH=1,
        // Obtain frame size used by encoder/decoder
        SPEEX_GET_FRAME_SIZE=3,
        // Set quality value
        SPEEX_SET_QUALITY=4,
        // Get current quality setting
        // SPEEX_GET_QUALITY=5 -- Doesn't make much sense, does it? */,
        // Set sub-mode to use
        SPEEX_SET_MODE=6,
        // Get current sub-mode in use
        SPEEX_GET_MODE=7,

// Set low-band sub-mode to use (wideband only
        SPEEX_SET_LOW_MODE=8,
        // Get current low-band mode in use (wideband only
        SPEEX_GET_LOW_MODE=9,

// Set high-band sub-mode to use (wideband only
        SPEEX_SET_HIGH_MODE=10,
        // Get current high-band mode in use (wideband only
        SPEEX_GET_HIGH_MODE=11,

// Set VBR on (1) or off (0)
        SPEEX_SET_VBR=12,
        // Get VBR status (1 for on, 0 for off)
        SPEEX_GET_VBR=13,

// Set quality value for VBR encoding (0-10)
        SPEEX_SET_VBR_QUALITY=14,
        // Get current quality value for VBR encoding (0-10)
        SPEEX_GET_VBR_QUALITY=15,

// Set complexity of the encoder (0-10)
        SPEEX_SET_COMPLEXITY=16,
        // Get current complexity of the encoder (0-10)
        SPEEX_GET_COMPLEXITY=17,

// Set bit-rate used by the encoder (or lower)
        SPEEX_SET_BITRATE=18,
        // Get current bit-rate used by the encoder or decoder
        SPEEX_GET_BITRATE=19,

// Define a handler function for in-band Speex reques
        SPEEX_SET_HANDLER=20,

// Define a handler function for in-band user-defined reques
        SPEEX_SET_USER_HANDLER=22,

// Set sampling rate used in bit-rate computation
        SPEEX_SET_SAMPLING_RATE=24,
        // Get sampling rate used in bit-rate computation
        SPEEX_GET_SAMPLING_RATE=25,

// Reset the encoder/decoder memories to zer
        SPEEX_RESET_STATE=26,

// Get VBR info (mostly used internally)
        SPEEX_GET_RELATIVE_QUALITY=29,

// Set VAD status (1 for on, 0 for off)
        SPEEX_SET_VAD=30,

// Get VAD status (1 for on, 0 for off)
        SPEEX_GET_VAD=31,

// Set Average Bit-Rate (ABR) to n bits per seconds
        SPEEX_SET_ABR=32,
        // Get Average Bit-Rate (ABR) setting (in bps)
        SPEEX_GET_ABR=33,

// Set DTX status (1 for on, 0 for off)
        SPEEX_SET_DTX=34,
        // Get DTX status (1 for on, 0 for off)
        SPEEX_GET_DTX=35,

// Set submode encoding in each frame (1 for yes, 0 for no, setting to no breaks the standard)
        SPEEX_SET_SUBMODE_ENCODING=36,
        // Get submode encoding in each frame
        SPEEX_GET_SUBMODE_ENCODING=37,

// SPEEX_SET_LOOKAHEAD=38,
        // Returns the lookahead used by Speex
        SPEEX_GET_LOOKAHEAD=39,

// Sets tuning for packet-loss concealment (expected loss rate)
        SPEEX_SET_PLC_TUNING=40,
        // Gets tuning for PLC
        SPEEX_GET_PLC_TUNING=41,

// Sets the max bit-rate allowed in VBR mode
        SPEEX_SET_VBR_MAX_BITRATE=42,
        // Gets the max bit-rate allowed in VBR mode
        SPEEX_GET_VBR_MAX_BITRATE=43,

// Turn on/off input/output high-pass filtering
        SPEEX_SET_HIGHPASS=44,
        // Get status of input/output high-pass filtering
        SPEEX_GET_HIGHPASS=45,

// Get "activity level" of the last decoded frame, i.e, 
        // how much damage we cause if we remove the frame
        SPEEX_GET_ACTIVITY=47
    }

// Preserving compatibility:
    public enum SpeexCompatCode {
        // Equivalent to SPEEX_SET_ENH
        SPEEX_SET_PF=0,
        // Equivalent to SPEEX_GET_ENH
        SPEEX_GET_PF=1
    }

// Values allowed for mode queries
    public enum SpeexModeQuery {
        // Query the frame size of a mode
        SPEEX_MODE_FRAME_SIZE=0,

// Query the size of an encoded frame for a particular sub-mode
        SPEEX_SUBMODE_BITS_PER_FRAME=1
    }

public enum SpeexVersion {
        // Get major Speex version
        SPEEX_LIB_GET_MAJOR_VERSION=1,
        // Get minor Speex version
        SPEEX_LIB_GET_MINOR_VERSION=3,
        // Get micro Speex version
        SPEEX_LIB_GET_MICRO_VERSION=5,
        // Get extra Speex version
        SPEEX_LIB_GET_EXTRA_VERSION=7,
        // Get Speex version string
        SPEEX_LIB_GET_VERSION_STRING=9,

//     SPEEX_LIB_SET_ALLOC_FUNC=10,
        //     SPEEX_LIB_GET_ALLOC_FUNC=11,
        //     SPEEX_LIB_SET_FREE_FUNC=12,
        //     SPEEX_LIB_GET_FREE_FUNC=13,

//     SPEEX_LIB_SET_WARNING_FUNC=14,
        //     SPEEX_LIB_GET_WARNING_FUNC=15,
        //     SPEEX_LIB_SET_ERROR_FUNC=16,
        //     SPEEX_LIB_GET_ERROR_FUNC=17,
    }
    
    // Modes supported by Speex
    public enum SpeexBand {
        // modeID for the defined narrowband mode
        SPEEX_MODEID_NB=0,
        // modeID for the defined wideband mode
        SPEEX_MODEID_WB=1,
        // modeID for the defined ultra-wideband mode
        SPEEX_MODEID_UWB=2,
        // Number of defined modes in Speex
        SPEEX_NB_MODES=3
    }

// Preprocessor control codes
    public enum PreprocessCtlCode {

// Set preprocessor denoiser state
        SPEEX_PREPROCESS_SET_DENOISE=0,
        // Get preprocessor denoiser state
        SPEEX_PREPROCESS_GET_DENOISE=1,

// Set preprocessor Automatic Gain Control state
        SPEEX_PREPROCESS_SET_AGC=2,
        // Get preprocessor Automatic Gain Control state
        SPEEX_PREPROCESS_GET_AGC=3,

// Set preprocessor Voice Activity Detection state
        SPEEX_PREPROCESS_SET_VAD=4,
        // Get preprocessor Voice Activity Detection state
        SPEEX_PREPROCESS_GET_VAD=5,

// Set preprocessor Automatic Gain Control level
        SPEEX_PREPROCESS_SET_AGC_LEVEL=6,
        // Get preprocessor Automatic Gain Control level
        SPEEX_PREPROCESS_GET_AGC_LEVEL=7,

// Set preprocessor dereverb state
        SPEEX_PREPROCESS_SET_DEREVERB=8,
        // Get preprocessor dereverb state
        SPEEX_PREPROCESS_GET_DEREVERB=9,

// Set preprocessor dereverb level
        SPEEX_PREPROCESS_SET_DEREVERB_LEVEL=10,
        // Get preprocessor dereverb level
        SPEEX_PREPROCESS_GET_DEREVERB_LEVEL=11,

// Set preprocessor dereverb decay
        SPEEX_PREPROCESS_SET_DEREVERB_DECAY=12,
        // Get preprocessor dereverb decay
        SPEEX_PREPROCESS_GET_DEREVERB_DECAY=13,

// Set probability required for the VAD to go from silence to voice
        SPEEX_PREPROCESS_SET_PROB_START=14,
        // Get probability required for the VAD to go from silence to voice
        SPEEX_PREPROCESS_GET_PROB_START=15,

// Set probability required for the VAD to stay in the voice state (integer percent)
        SPEEX_PREPROCESS_SET_PROB_CONTINUE=16,
        // Get probability required for the VAD to stay in the voice state (integer percent)
        SPEEX_PREPROCESS_GET_PROB_CONTINUE=17,

// Set maximum attenuation of the noise in dB (negative number)
        SPEEX_PREPROCESS_SET_NOISE_SUPPRESS=18,
        // Get maximum attenuation of the noise in dB (negative number)
        SPEEX_PREPROCESS_GET_NOISE_SUPPRESS=19,

// Set maximum attenuation of the residual echo in dB (negative number)
        SPEEX_PREPROCESS_SET_ECHO_SUPPRESS=20,
        // Get maximum attenuation of the residual echo in dB (negative number)
        SPEEX_PREPROCESS_GET_ECHO_SUPPRESS=21,

// Set maximum attenuation of the residual echo in dB when near end is active (negative number)
        SPEEX_PREPROCESS_SET_ECHO_SUPPRESS_ACTIVE=22,
        // Get maximum attenuation of the residual echo in dB when near end is active (negative number)
        SPEEX_PREPROCESS_GET_ECHO_SUPPRESS_ACTIVE=23,

// Set the corresponding echo canceller state so that residual echo suppression can be performed (NULL for no residual echo suppression)
        SPEEX_PREPROCESS_SET_ECHO_STATE=24,
        // Get the corresponding echo canceller state
        SPEEX_PREPROCESS_GET_ECHO_STATE=25,

// Set maximal gain increase in dB/second (int32)
        SPEEX_PREPROCESS_SET_AGC_INCREMENT=26,

// Get maximal gain increase in dB/second (int32)
        SPEEX_PREPROCESS_GET_AGC_INCREMENT=27,

// Set maximal gain decrease in dB/second (int32)
        SPEEX_PREPROCESS_SET_AGC_DECREMENT=28,

// Get maximal gain decrease in dB/second (int32)
        SPEEX_PREPROCESS_GET_AGC_DECREMENT=29,

// Set maximal gain in dB (int32)
        SPEEX_PREPROCESS_SET_AGC_MAX_GAIN=30,

// Get maximal gain in dB (int32)
        SPEEX_PREPROCESS_GET_AGC_MAX_GAIN=31,

//  Can't set loudness
        // Get loudness
        SPEEX_PREPROCESS_GET_AGC_LOUDNESS=33
    }

public enum JitterBufferRetCode {
        // Packet has been retrieved
        JITTER_BUFFER_OK = 0,
        // Packet is lost or is late
        JITTER_BUFFER_MISSING = 1,
        // A "fake" packet is meant to be inserted here to increase buffering
        JITTER_BUFFER_INSERTION = 2,
        // There was an error in the jitter buffer
        JITTER_BUFFER_INTERNAL_ERROR = -1,
        // Invalid argument
        JITTER_BUFFER_BAD_ARGUMENT = -2,
    }

// Jitter Buffer Control Codes
    public enum JitterBufferCtlCode {
        // Set minimum amount of extra buffering required (margin)
        JITTER_BUFFER_SET_MARGIN = 0,
        // Get minimum amount of extra buffering required (margin)
        JITTER_BUFFER_GET_MARGIN = 1,
        /* JITTER_BUFFER_SET_AVAILABLE_COUNT wouldn't make sense */

// Get the amount of available packets currently buffered
        JITTER_BUFFER_GET_AVAILABLE_COUNT = 3,
        // Included because of an early misspelling (will remove in next release)
        JITTER_BUFFER_GET_AVALIABLE_COUNT = 3,

// Assign a function to destroy unused packet. When setting
        // that, the jitter buffer no longer copies packet data.
        JITTER_BUFFER_SET_DESTROY_CALLBACK = 4,
        JITTER_BUFFER_GET_DESTROY_CALLBACK = 5,

// Tell the jitter buffer to only adjust the delay in
        // multiples of the step parameter provided
        JITTER_BUFFER_SET_DELAY_STEP = 6,
        JITTER_BUFFER_GET_DELAY_STEP = 7,

// Tell the jitter buffer to only do concealment in multiples of the size parameter provided
        JITTER_BUFFER_SET_CONCEALMENT_SIZE = 8,
        JITTER_BUFFER_GET_CONCEALMENT_SIZE = 9,

// Absolute max amount of loss that can be tolerated
        // regardless of the delay. Typical loss should be half of
        // that or less.
        JITTER_BUFFER_SET_MAX_LATE_RATE = 10,
        JITTER_BUFFER_GET_MAX_LATE_RATE = 11,

// Equivalent cost of one percent late packet in timestamp units
        JITTER_BUFFER_SET_LATE_COST = 12,
        JITTER_BUFFER_GET_LATE_COST = 13
    }
    
    public unsafe class SpeexCodec {

public struct SpeexBits
        {
            char* chars;        /* "raw" data */
            int nbBits;         /* Total number of bits stored in the stream*/
            int charPtr;        /* Position of the byte "cursor" */
            int bitPtr;         /* Position of the bit "cursor" within the current char */
            int owner;          /* Does the struct "own" the "raw" buffer (member "chars") */
            int overflow;       /* Set to one if we try to read past the valid data */
            int buf_size;       /* Allocated size for buffer */
            int reserved1;      /* Reserved for future use */
            void* reserved2;    /* Reserved for future use */
        }

public struct JitterBuffer {
        }
        
        public struct JitterBufferPacket {
            public byte *data;         /* Data bytes contained in the packet */
            public uint len;           /* Length of the packet in bytes */
            public uint timestamp;     /* Timestamp for the packet */
            public uint span;          /* Time covered by the packet (timestamp units) */
        }

public struct SpeexPreprocessState {
        }
        
        // EXPORTED ENCODER METHODS

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void *speex_encoder_init_new(int modeID);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_encoder_ctl(void *state, int request, void *ptr);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_encode_int(void *state, short *input, SpeexBits *bits);	// IntPtr

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_encoder_destroy(void* state);

// EXPORTED ENCODER BIT-OPERATION METHODS

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_bits_write(SpeexBits *bits, byte *bytes, int max_len);	// char *

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_bits_write_whole_bytes(SpeexBits *bits, byte *bytes, int max_len);	// char *

// EXPORTED DECODER METHODS

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void* speex_decoder_init_new(int modeID);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_decoder_ctl(void *state, int request, void *ptr);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_bits_read_from(SpeexBits *bits, byte *inputBuffer, int inputByteCount);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_decode_int(void* state, SpeexBits *bits, short* output);	// IntPtr

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_decoder_destroy(void* state);

// Preprocessor API

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern SpeexPreprocessState *speex_preprocess_state_init(int frame_size, int sampling_rate);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void speex_preprocess_state_destroy(SpeexPreprocessState *st);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_preprocess_run(SpeexPreprocessState *st, short *x);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void speex_preprocess_estimate_update(SpeexPreprocessState *st, short *x);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_preprocess_ctl(SpeexPreprocessState *st, int request, void *ptr);        
        
        // Jitter Buffer API
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern JitterBuffer *jitter_buffer_init(int step_size);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void jitter_buffer_reset(JitterBuffer *jitter);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void jitter_buffer_destroy(JitterBuffer *jitter);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void jitter_buffer_put(JitterBuffer *jitter, JitterBufferPacket *packet);
        
        [DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int jitter_buffer_get(JitterBuffer *jitter, JitterBufferPacket *packet, int desired_span, int *start_offset);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int jitter_buffer_get_another(JitterBuffer *jitter, JitterBufferPacket *packet);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int jitter_buffer_get_pointer_timestamp(JitterBuffer *jitter);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void jitter_buffer_tick(JitterBuffer *jitter);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void jitter_buffer_remaining_span(JitterBuffer *jitter, uint rem);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int jitter_buffer_ctl(JitterBuffer *jitter, int request, void *ptr);

[DllImport("libspeexdsp.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int jitter_buffer_update_delay(JitterBuffer *jitter, JitterBufferPacket *packet, int *start_offset);

// Utility methods

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern void speex_bits_init(SpeexBits* bits);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_bits_reset(SpeexBits* bits);

[DllImport("libspeex.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.Cdecl)]
        public static extern int speex_bits_destroy(SpeexBits* bits);

// SpeexCodec data members

private int frameSize;	
        private int maxFrameSize;

private void *encoderState;
        private SpeexBits encodedBits;

private SpeexPreprocessState *preprocessState;
        
        private bool validJitterBits = false;
        private JitterBuffer *jitterBuffer = null;
        private byte[] encodedJitterFrame = new byte[2048];
        private int encodedJitterFrameLength;
        private int encodedJitterFrameErrorCode;
        
        // These are just used for logging
        public byte[] EncodedJitterFrame {
            get {
                return encodedJitterFrame;
            }
        }
        
        public int EncodedJitterFrameLength {
            get {
                return encodedJitterFrameLength;
            }
        }
        
        public int EncodedJitterFrameErrorCode {
            get {
                return encodedJitterFrameErrorCode;
            }
        }
        
        // Provide something to lock
        private class JitterBufferLockable {
        }
        private JitterBufferLockable jitterBufferLockable = new JitterBufferLockable();
        
        void *decoderState;
        SpeexBits decodedBits;

public int SetOneCodecSetting(bool encoder, SpeexCtlCode setting, int value) {
            int retcode = 0;
            unsafe {
                int *intPtr = &value;
                if (encoder)
                    retcode = speex_encoder_ctl(encoderState, (int)setting, (void *)intPtr);
                else
                    retcode = speex_decoder_ctl(decoderState, (int)setting, (void *)intPtr);
            }
            return retcode;
        }
        
        public int GetOneCodecSetting(bool encoder, SpeexCtlCode setting, ref int value) {
            int retcode = 0;
            unsafe {
                fixed (int *intPtr = &value) {
                    if (encoder)
                        retcode = speex_encoder_ctl(encoderState, (int)setting, (void *)intPtr);
                    else
                        retcode = speex_decoder_ctl(decoderState, (int)setting, (void *)intPtr);
                }
            }
            return retcode;
        }
        
        public int SetOnePreprocessorSetting(PreprocessCtlCode setting, int value) {
            int retcode = 0;
            unsafe {
                int *intPtr = &value;
                retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)intPtr);
            }
            return retcode;
        }
        
        public int SetOnePreprocessorSetting(PreprocessCtlCode setting, float value) {
            int retcode = 0;
            unsafe {
                float *floatPtr = &value;
                retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)floatPtr);
            }
            return retcode;
        }
        
        public int SetOnePreprocessorSetting(PreprocessCtlCode setting, bool bValue) {
            int value = (bValue ? 1 : 0);
            int retcode = 0;
            unsafe {
                int *intPtr = &value;
                retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)intPtr);
            }
            return retcode;
        }
        
        public int GetOnePreprocessorSetting(PreprocessCtlCode setting, ref int value) {
            int retcode = 0;
            unsafe {
                fixed (int *intPtr = &value) {
                    retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)intPtr);
                }
            }
            return retcode;
        }
        
        public int GetOnePreprocessorSetting(PreprocessCtlCode setting, ref float value) {
            int retcode = 0;
            unsafe {
                fixed (float *floatPtr = &value) {
                    retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)floatPtr);
                }
            }
            return retcode;
        }
        
        public int GetOnePreprocessorSetting(PreprocessCtlCode setting, ref bool value) {
            int retcode = 0;
            int intValue = 0;
            unsafe {
                retcode = speex_preprocess_ctl(preprocessState, (int)setting, (void *)&intValue);
            }
            value = (intValue == 0 ? false : true);
            return retcode;
        }
        
        public int SetOneJitterBufferSetting(JitterBufferCtlCode setting, int value) {
            int retcode = 0;
            unsafe {
                int *intPtr = &value;
                retcode = jitter_buffer_ctl(jitterBuffer, (int)setting, (void *)intPtr);
            }
            return retcode;
        }
        
        public int InitEncoder(int maxFrameSize, int samplesPerFrame, int samplingRate) {
            this.maxFrameSize = maxFrameSize;
            encodedBits = new SpeexBits();
            encoderState = speex_encoder_init_new(0);
            // Don't set VAD in the codec, because we're setting it in
            // the preprocessor instead
            fixed (int *fSize = &frameSize) {
                speex_encoder_ctl(encoderState, (int)SpeexCtlCode.SPEEX_GET_FRAME_SIZE, fSize);
            }
            fixed (SpeexBits *bitsAdd = &encodedBits) {
                speex_bits_init(bitsAdd);
            }
            preprocessState = speex_preprocess_state_init(samplesPerFrame, samplingRate);
            return frameSize;
        }
    
        public int PreprocessFrame(short[] sampleBuffer) {
            fixed (short *fixedSamples = sampleBuffer) {
                return speex_preprocess_run(preprocessState, fixedSamples);
            }
        }
        
        public int EncodeFrame(short[] inputFrame, byte[] outputFrame) {
            int encodedDataSize = 0;
            fixed (short *inputAdd = inputFrame) {
                fixed (SpeexBits *bitsAdd = &encodedBits) {
                    speex_encode_int(encoderState, inputAdd, bitsAdd);
                    fixed (byte* outputBytes = outputFrame) {
//                         encodedDataSize = speex_bits_write_whole_bytes(bitsAdd, outputBytes, maxFrameSize);
                        encodedDataSize = speex_bits_write(bitsAdd, outputBytes, maxFrameSize);
                    }
                }
            }
            fixed (SpeexBits *bitsAdd = &encodedBits) {
                speex_bits_reset(bitsAdd);
            }
            return encodedDataSize;
        }

public void ResetEncoder() {
            fixed (SpeexBits *bitsToAdd = &encodedBits) {
                speex_bits_destroy(bitsToAdd);
            }
            if (encoderState != null) {
                speex_encoder_destroy(encoderState);
                encoderState = null;
            }
        }
    
        public void InitDecoder(bool useJitterBuffer, int stepSize, int frameSize) {
            this.frameSize = frameSize;
            decodedBits = new SpeexBits();
            decoderState = speex_decoder_init_new(0);
            fixed (SpeexBits *bitsDecode = &decodedBits)
            {
                speex_bits_init(bitsDecode);
            }
            if (useJitterBuffer) {
                jitterBuffer = jitter_buffer_init(stepSize);
                validJitterBits = false;
            }
        }

const int FrameSize = 320;

private static log4net.ILog logger = log4net.LogManager.GetLogger(typeof(SpeexCodec));

public int DecodeFrame(byte[] inputToDecode, int encodedByteCount, short[] decodedFrame)
        {
            DecoderReadFrom(inputToDecode, encodedByteCount);
            return DecoderDecodeBits(decodedFrame);
        }

public void DecoderReadFrom(byte[] inputToDecode, int encodedByteCount) {
            fixed (SpeexBits *bitsDecoder = &decodedBits) {
                fixed (byte *inputFrame = inputToDecode) {
                    speex_bits_read_from(bitsDecoder, inputFrame, encodedByteCount);
                }
            }
        }

public int DecoderDecodeBits(short[] decodedFrame)
        {
            fixed (SpeexBits* bitsDecoder = &decodedBits)
            {
                fixed (short* outputFrame = decodedFrame)
                {
                    return speex_decode_int(decoderState, bitsDecoder, outputFrame);
                }
            }
        }

public int DecoderDecodeNullBits(short[] decodedFrame)
        {
            fixed (short* outputFrame = decodedFrame)
            {
                return speex_decode_int(decoderState, null, outputFrame);
            }
        }

public void ResetDecoder()
        {
            fixed (SpeexBits* bitsDecoder = &decodedBits)
            {
                speex_bits_destroy(bitsDecoder);
            }
            if (decoderState != null)
            {
                speex_decoder_destroy(decoderState);
                decoderState = null;
            }
            if (jitterBuffer != null)
            {
                jitter_buffer_destroy(jitterBuffer);
                jitterBuffer = null;
            }
        }

// Jitter buffer wrapper API

// Locking must be done at the application level to ensure
        // that two threads can't be in jitter buffer methods.  
        // timestamp is a counter incremented once per "tick"
        public void JitterBufferPut(byte[] frame, int startIndex, uint byteCount, uint timestamp)
        {
            if (jitterBuffer == null)
                throw new Exception("JitterBufferPut: jitterBuffer is null!");
            lock (jitterBufferLockable)
            {
                JitterBufferPacket p = new JitterBufferPacket();
                unsafe
                {
                    fixed (byte* frameBytes = &frame[startIndex])
                    {
                        p.data = frameBytes;
                        p.len = byteCount;
                        p.timestamp = timestamp;
                        p.span = (uint)frameSize;
                        jitter_buffer_put(jitterBuffer, &p);
                    }
                }
            }
        }

// Returns the length of the _encoded_ frame in bytes
        public void JitterBufferGet(short[] decodedFrame, uint timestamp, ref int startOffset)
        {
            int i;
            int ret;
            int activity = 0;

if (jitterBuffer == null)
                throw new Exception("JitterBufferPut: jitterBuffer is null!");

lock (jitterBufferLockable)
            {
                if (validJitterBits)
                {
                    // Try decoding last received packet
                    ret = DecoderDecodeBits(decodedFrame);
                    if (ret == 0)
                    {
                        jitter_buffer_tick(jitterBuffer);
                        return;
                    }
                    else
                        validJitterBits = false;
                }

JitterBufferPacket packet = new JitterBufferPacket();
                packet.span = (uint)frameSize;
                packet.timestamp = timestamp;
                // The encoded buffer must be fixed, because
                // jitter_buffer_get refers to it through packet
                unsafe
                {
                    fixed (byte* pData = &encodedJitterFrame[0])
                    {
                        fixed (int* pStartOffset = &startOffset)
                        {
                            packet.data = pData;
                            packet.span = (uint)frameSize;
                            packet.len = 2048;
                            ret = jitter_buffer_get(jitterBuffer, &packet, frameSize, pStartOffset);
                        }
                    }
                }
                encodedJitterFrameErrorCode = ret;
                if (ret != (int)JitterBufferRetCode.JITTER_BUFFER_OK)
                {
                    // No packet found: Packet is late or lost
                    DecoderDecodeNullBits(decodedFrame);
                }
                else
                {
                    encodedJitterFrameLength = (int)packet.len;
                    DecoderReadFrom(encodedJitterFrame, encodedJitterFrameLength);
                    /* Decode packet */
                    ret = DecoderDecodeBits(decodedFrame);
                    if (ret == 0)
                        validJitterBits = true;
                    else
                    {
                        /* Error while decoding */
                        for (i = 0; i < frameSize; i++)
                            decodedFrame[i] = 0;
                    }
                }

GetOneCodecSetting(false, SpeexCtlCode.SPEEX_GET_ACTIVITY, ref activity);
                if (activity < 30)
                    jitter_buffer_update_delay(jitterBuffer, &packet, null);
                jitter_buffer_tick(jitterBuffer);
            }
        }

}
}
```
另外还有一个开源库，可以用作尝试，Github 找不到了，直接上传代码吧：[SpeexUtil.zip][4]

需要注意的是，这些源码中音频流编解码输出都是以`short[]`作为单位的，这也就正好对应了`1个sample对应两个字节`的说法，`short[]`与`byte[]`之间的转换用内存拷贝就可以了。

以上是`Speex`解码成`PCM`部分，解码完了，然后就是每两帧`PCM`编码成`G726`然后封装成 RTP 负载包格式（头中的编码类型信息需要对应好）发送给设备了，为什么是两帧 640 长度编码为一个`G726`发给设备？因为厂商文档中说了每秒 25 帧，也就是 40 ms 一帧，speex 是每 20ms 一帧，或者从设备传上来的音频流解码成`PCM`也可以知道每次解码出来的`PCM`长度是640。

最后附上 flash 播放器代码（播放器和对讲推流集成在一起）：[f4player-modified.zip][5]

整个流程非常简单：
开启对讲->网页flash推流麦克风音频流->下发对讲命令->视频转发服务器接收到设备音频流后开始拉取网页麦克风语音->解码网页麦克风的speex音频流为PCM->将PCM数据编码成G726或其他格式后再封装成RTP数据包->转发给设备

但其中涉及到的音频编解码知识是以前都没接触过的，所以踩了很多的坑。
需要学习的知识：
 1. G726 <-> PCM <-> AAC 之间的编解码
 2. RTMP 中 Speex 音频流数据格式的解析
 3. flash ActionScript 编程
 4. librtmp 库的使用（拉取 RTMP 中的音频流），这又涉及到 C语言 编程（依赖的openssl、zlib库的编译配置比较繁琐）

总之这次对讲调试光编程语言就涉及到了4种：Java（Web、GPS服务器）、C#（视频转发服务器）、C语言（librtmp库的封装）、ActionScript（flash播放器的修改）

头大，请让我静一静。。。

[1]: https://github.com/EasyDarwin/EasyAACEncoder
  [2]: https://blog.csdn.net/simongyley/article/details/8469914
  [3]: https://0o0.me/usr/uploads/2018/12/1164427476.png
  [4]: https://0o0.me/usr/uploads/2018/12/2147469326.zip
  [5]: https://0o0.me/usr/uploads/2018/12/665081256.zip

部标 JT/T 1078 协议双向对讲实现

发表评论：