Abstract:
Increasing efficiency in biometric authentication via speech recognition and identification and its use in mobile devices has been one of the most invested researches worldwide in computing industry. Since the very initial state of using speech recognition algorithms towards very recent time even in the current year 2018, different strategies and combinations have been used to optimize the result in order to surpass human recognition capacity and achieve even more!
For long many years, ¬various speech signal processing techniques have been experimented and optimized using expectation maximization or gradient descent optimization or their variations across end-to-end speech feature extraction and recognition scheme, but the result was below the satisfactory limit despite multitude of time, cost and effort have been invested.
Very recently, huge improvement of computing power of devices, made it possible to use complex multi-layered neural network technologies (i.e., deep learning or deep neural network) such as convolutional net, long short term memory, bidirectional recurrent neural network as well as complex statistical or evolutionary strategies and its variations to optimize further the results reducing the error rates.
To this end, using series of combination of various deep learning algorithms across end-to-end speech features and language modelling it has been possible by some big companies and join venture investments to attain a somewhat notable achievement: that the experiment just surpassed the human efficiency.
But, still we have been far way behind the recognition efficiency to be more promising, to identify a practically useful and achievable optimal solution which can equally perform in noisy environments and mutations of speech features.
This thesis work has emphasized mostly on how to devise an efficient technique that would reduce the time, cost and complexity of such huge efforts so far done so that future improvements can be made on this optimum path.
To this end, it has been identified that text independent speech recognition can be efficiently trained, if deep learning technology with the guidance of genetic algorithm (GA) through intelligently choosing hyper-parameters of the networks can be adopted.
It has been experimented that series of iterations to estimate and re-estimate the hyper-parameters can lead to a better and optimal solution with extremely less time and cost. It can be calculated that the runtime is O(No. of generations) instead of O(variations ^ network parameters), to save time extremely compared to legacy processes of selecting series of deep learning networks.
As a way forward, we have suggested more automated parameter fixing followed by automated iterations can be a future attempt for such implementation.