Programming + Siri hears and answers, or do something

When you talk with your voice, Siri adds a schedule to the calendar, makes a phone call, and searches for the nearest popular restaurant.In TVCM, cookies monsters do "Hey Siri!, Set a timer in 14 minutes" (appealing to be usable even if the hands are dirty), etc.It will be a topic.

I think Siri has made smartphones more convenient and more attractive since it was mounted on iOS 5 with the appearance of iPhone 4S in 2011.In addition, iOS 10 has finally released aPIs for developers (although the available functions are very limited), and the application in cooperation with Siri is also expected.

In addition to Siri, Google Voice Search / Operation (android, iOS, WEB compatible) familiar with "OK Google", "Talking Concier" by NTT DOCOMO, Microsoft's "Cortana", "Voice assist" by Yahoo!, Services that can be searched and operated by voice are starting one after another.

Following the previous "Message delivered on LINE", let's take a look at the Siri movement this time.as before, the details of the whole image of Siri are not disclosed, so please be aware that it is a rough guess that can be confirmed from the behavior, "Roughly speaking" and "moving like this".

You can see the Calo project's paper on the DaRPa (Defense Research Planning Bureau), which was the source of Siri, in "Design and IMPLEMENTOTION OF THE CaLO KERY Manager" (2006).

The partner you are talking about is a server over the network

First of all, as a basic thing, when talking on SIRI, not all your smartphones are doing all the reply.

Siri is not running on a smartphone alone.

This can also be confirmed from the fact that the Siri itself cannot be used if the "in -flight mode" is turned on and the network is blocked at all.

If it is not connected to the network, Siri itself cannot be used.

In other words, when talking to Siri, you can see that you are connected to an external service via a network.The other party is not the iPhone itself, but an apple server at the end of a network via iPhone.

Siri accesses the server via a network.

as an aside, if you use a network packet collection / analysis tool such as Wireshark, the server of the connection destination is GUZZONI..apple.You can see that it is a com.

What is the data sent to the server?

Here is one question.When talking to the iPhone on Siri, what is the server what is being told?

Since Siri is unusable other than the iPhone, it is thought that it sends a unique identification number spoken when you start communicating with the server immediately after starting Siri.If you look at the packet using Wireshark earlier, it seems that you are connected to the apple server and interacting immediately after starting Siri (it is encrypted, so look at the contents.I can not do it).anyway, it seems that the terminal authentication is performed here and the server and connection authentication are performed prior to the start of use of Siri.

The iPhone is connected to the Siri server, and when the connection is established, it becomes this screen.

Now, next, we actually talk to you, what is sent to the server at this time?Is it a text data that converted the content you talked to into a string with an iPhone?Or is it the audio data you talked about?

Certainly, the iPhone keyboard has a button for voice recognition, so you can think that it converts it to text data on the iPhone and sends it to the server.

However, if the network is not used at all, such as "in -flight mode", it can be confirmed that the audio recognition button is grayed out and cannot be used (in fact, only in English.In the language, the character input function by voice recognition is available even without a network).

If it is not connected to the network, the Japanese voice recognition itself cannot be used.

In other words, when using Siri, the sound itself speaking to the iPhone is sent to the server.Then, on the server, the sentence data sent to the character string data, sent back to the iPhone, and the string is displayed on the iPhone.

When I talked to Siri, the audio data was sent to the server and converted to a string on the server.

What kind of processing is the server?(1)

So how does the audio convert on the server?First, convert the data as a text into data as a text.This is called voice recognition.Siri says that Nuance Communications uses the speech recognition technology.

Generally, audio recognition uses a sound model and a language model.The acoustic model has a large amount of sound waveform data and the transcript of them as text.It is used to remember such a waveform, "a", such a waveform, and "I", and to convert it into sound elements (hiragana, alphabet, phonetic symbols, etc.) based on the waveforms you talked about.In order to correctly recognize any vocalizations of any person, it is a system for statistically processing using a large amount of audio data (such as voice data, the probability of this phoneme is high, etc.).

The language model, on the other hand, has a dictionary that collects the word itself and a dictionary that stats stigmatively express the knowledge of the words.Use these to convert them to a string by statistically processing to the most likely sentence from the line of phoneme (with a high probability of this kanji or Kana in this lineup).It is a mechanism for.

Using a large amount of data statistically, audio data is converted into text data.

Unlike languages, such as English, the language and parts of speech are clear, Japanese is expressed in one connection in nouns, verbs, athletic, etc.I will divide it.This is called morphological analysis."MECaB" is famous for Japanese morphological analysis, but it is also used in iOS and MacOS.

as a result of actually performing morphological analysis with the open source software "MECaB"

although it is somewhat eerie (laughs) output, I think that you can see that the input sentence is analyzed and divided into parts of the parts of the part.

プログラミング+ Siriが話を聞いて答えたり、何かをやってくれるしくみ

Up to this point, the content of "Endōsanni Uchiawaseniokuremasu to MēRUWOOKUTTE" has been converted into a Japanese character string called "Send an email to Mr. Endo to meet a meeting."

The story seemed to be a little difficult, but in short, these processes were performed on the Siri server, the audio was instantly converted into a text string, sent to the iPhone each time during the utterance, and the content of the utterances.Is displayed.While talking, it's fun to watch the sentences changing one after another, but this is what happens behind them.

What kind of processing is the server?(2)

By the way, it was converted to the text data, but of course it is not the end.after this, we will enter the process of reading the meaning of the content we talked.The main processes are syntax analysis and meaning analysis.

Symphony analysis is a process that analyzes how each of the split speech is involved.In the case of Japanese, it is also called the receiving analysis.Let's analyze the sentence just like MECaB, CaBOCHa, which is in charge of Japan from Japan.

The result of the actual analysis of the open source software "CaBOCHa".

It may be a bit confusing, but this is what it means.

Send to Mr. Endo

To be beaten → delay,

I will be late to hit, → Send it

Send email →

In this way, processing is performed to grasp the logical structure of the sentence.By performing this process, it can help you understand the meaning of the sentence more accurately.

and finally, the meaning analysis is performed.The processing of reading the sentence, to be precise, analyzes what the speaker (user) wants to do (and what he can do with the iPhone).He seems to use neural networks and machine learning technology in his current Siri.It is really amazing technology that it does not take a few seconds to process so far.

By the way, Siri and others are described as examples of aI (artificial intelligence), but they do not really have intelligence (aside from the philosophical discussion that "intelligence is what is").What has been done so far is a very advanced and high -speed information processing and statistical processing, but after that, the most in the "pattern that operates apps and OS" prepared by apple developers.It is a process of applicing.

(Many amounts, etc.)

In this way, as a result of the meaning analysis, it determines which of the operation patterns provided by apple.If you decide to search for information instead of app operation, the Siri server may get an answer using an appropriate external site service.Many seem to be thrown as a question in the question response system "Wolfram | alpha".

and if it does not apply to any order or search, it will give you a lot of humor as a chat, but it will be composed of a large amount of rules, "If you say this, you will replace this."It is assigned to the conversation engine.It is a Siri that can talk freely like a chat between people (it makes me feel like), but in fact, it is strange that the answer is also incorporated in advance in the same way.I feel.Some people may remember the familiar expression of "artificial brainlessness" (see Q & a at the end of the sentence).

What is the information returned from the server?

By the way, if you return to the beginning of the example, you just need to drop it into the "instruction".In this case,

It will be dropped into the content.These contents are sent back to his iPhone from the server.

From the contacts registered on the iPhone, search for the content that is likely to apply to "Endo -san", and get the destination e -mail address from there.In this case, it was determined that only the text of the email was determined in the information required for email transmission (destination, subject, text, and used application).

In response to this, the "Email" app has a program that uses Siri to encourage voice input in the text of the e -mail (when the text is not specified, check the text, and when the destination is not specified.Check the destination, etc.).as a result, the message, "I understand. What kind of content do you want to do?" Is displayed on the screen.

after various processing, it reaches the state of "What kind of text do you want?"

Similarly, if there are two or more Endo with the same last name on the iPhone contact information, the destination cannot be confirmed, so it will encourage you to choose, "Which Endo is?"

A combination of advanced technology, but the principle is surprisingly simple?

This time, while presuming the principle of operation of Siri, I have a rough look at voice recognition and natural language processing.a recent voice dialogue system that skillfully combines research results of natural language processing that has been conducted for more than half a century, and has maximized the maximum of recent technologies such as machine learning and neural networks to improve efficiency and accuracy.It is difficult to create a program from one day, but if you think that there is Siri etc. as a culmination of various techniques and knowledge in the past, I feel different from that mysterious answer.maybe.

Let's ask questions Q & a

Q.In the explanation, the word "artificial cerebral" came out.What is it?

a."Artificial incompetence" (or "artificial cerebral") is called "Chatterbot" in English.It means "talk bot" ("bot" is a short -rated "robot").

これの元祖といわれるのが、1966年、ジョセフ・ワイゼンバウム氏によって作られた「ELIZa」というプログラムです。キーボードを使って人間と対話するように作られたプログラムですが、相手がコンピューターだと言われても真剣に対話する人が少なくなかったそうです。

However, in fact, the program only excluded keywords from the sentence entered by humans, converted it by predetermined rules and returned a reply.At first glance, it looks like artificial intelligence, but in Japanese, it is called "artificial incompetence" in Japanese.

(ELIZa と同じプログラムを Emacs Lisp というプログラミング言語で書いたバージョンのソースコードはhttps://www.csee.UMBC.Edu/Courses/471/Papers/Emacs-Doctor.You can see it on SHTML.

JavaScript version is http: // www.Masswerk.You can try it with at/ elizabot/).

After that, it was also a more natural (but it would be difficult to say that it has intelligence), with the functions that automatically learn from the past conversation content.In Japanese

It has been developed.

最近ではマイクロソフトの女子高生aI「りんな」 も話題になりました。これは検索エンジン Bing、機械学習プラットフォーム、ビッグデータ解析などを組み合わせて実現されているものです。

On the other hand, attention has been focused on the use of artificial cerebral -free "conversation robots that accept and process instructions in messages", and this year, especially LINE Bot and Facebook Messenger Bot.The environment where you can easily make it is also in place.There is also a bot -creation framework called Hubot for chat Slack, which is popular with engineers.

It seems that people can use messenger characters and audio inputs more easily than to operate the app by inserting the input item from the screen.

Surprisingly many reference examples are rolling on the web, so you can easily make your own artificial sterot -free bot.Please try it happily.

1970年生まれ。大阪大学大学院基礎工学研究科博士後期課程中退。龍谷大学理工学部助手、レッドハットを経て、ヴァインカーブにてコンサルティング、カスタムシステムの開発・構築、オープンソースに関する研究開発、書籍・原稿の執筆などを行う。2014年からフリー。Vine Linuxの開発団体Project Vine副代表。ボランティアで写真アプリ「インスタグラム」の日本語化に貢献。2015年に「子どもを億万長者にしたければプログラミングの基礎を教えなさい」(KaDOKaWa/メディアファクトリー)を出版。