Microsoft patented techniques for generating voice with predetermined emotion type

Computer speech synthesis is increasingly becoming prevalent in the modem computing devices. For example, smartphones are expected to offer an intelligent personal assistant interface for a user of the smartphone, providing services such as answering user questions and providing reminders or other useful information.

In the existing speech synthesis systems, the emotional content is absent from the speech rendering of an underlying text.

Microsoft has patented techniques for generating voice with predetermined emotion type as shown in Table below.

First, multiple emotionally diverse candidate speech segments associated with a given semantic content are generated.

The table in the figure above shows four candidate speech segments, which offer a diversity of emotional content corresponding to the specified semantic content (“The Red Sox have won the World Series”), in that each candidate speech segment has text content and heuristic characteristics that will provide the listener with a perceived emotional content distinct from the other candidate speech segments.

Thereafter, the candidate speech segments are ranked according a predetermined criteria. The highest ranked candidate speech segment is converted into audio form and played to the user.

Patent Information
Publication number: US 20160071510
Patent Title: Voice Generation With Predetermined Emotion Type
Publication date: 10 Mar 2016
Filing date: 8 Sep 2014
Inventors: Chi-Ho Li; Baoxun Wang; Max Leung;
Original Assignee: Microsoft Corporation

US20160071510