Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on January 2nd, 2024 and was peer-reviewed by 3 reviewers and the Academic Editor.
The Academic Editor made their initial decision on January 31st, 2024.
The first revision was submitted on March 1st, 2024 and was reviewed by 3 reviewers and the Academic Editor.
The article was Accepted by the Academic Editor on March 11th, 2024.

Version 0.2 (accepted)

Junaid Shuja · Mar 11, 2024 · Academic Editor

I confirm that the authors have addressed all of the reviewers' comments. The reviewer reports for 2nd round indicate the same.

[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]

Mohammad Ali Humayun · Mar 2, 2024

Basic reporting

Ok. My concerns have been addressed

Experimental design

My concerns have been addressed

Validity of the findings

The authors have addressed my concerns

Cite this review as

Humayun MA (2024) Peer Review #1 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.2)". PeerJ Computer Science

Reviewer 2 · Mar 10, 2024

Basic reporting

The authors have revised the manuscript according to the previous comments. I am grateful to the authors for carefully addressing each comment and am happy to suggest the acceptance of the article.

Experimental design

No further comments

Validity of the findings

No further comments

Cite this review as

Anonymous Reviewer (2024) Peer Review #2 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.2)". PeerJ Computer Science

Reviewer 3 · Mar 11, 2024

Basic reporting

The Authors have successfully addressed the comments mentioned earlier.

Experimental design

No More changes

Validity of the findings

No changes suggessted more

Additional comments

No More changes are required

Cite this review as

Anonymous Reviewer (2024) Peer Review #3 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter - submitted Mar 1, 2024

Version 0.1 (original submission)

Junaid Shuja · Jan 31, 2024 · Academic Editor

Major Revisions

The authors should revise the article in view of the comments and provide detailed response letter in revised submission.

**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at copyediting@peerj.com for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff

Mohammad Ali Humayun · Jan 19, 2024

Basic reporting

Introduction:

While mentioning the key contributions of the proposals authors haven't clearly explained which research gaps do they fill and how. E.g plz explain clearly why the existing realtime model needs improvement.

Incremental Clustering (Method)

The authors improve an existing incremental clustering approach.
Here again it will be good to elaborate which aspects of the existing method does the proposed approach improve.

Results and discussion:
The authors mention that their method is more suitable for practical realtime applications but never elaborate how.

Experimental design

Introduction:
Similar to the reporting issue, while mentioning their main contributions the authors highlight that their method is more suitable for practical realtime applications but never follow up with any quantified computations to support the claim.

Experimental Setup (Experimental Setup)
The authors manually set a lot of threshold values in their method without providing any formal validation or motivation for the values chosen. E.g
a. Advance step duration
b. Probability value for identifying silence
c. Similarity measure value for new speaker

Results and Discussion:
The authors acknowledge that their method is suitable for speakers being less than three. Can they list application scenarios for this constraint being valid? Specially considering that method is titled being for speaker diarization

Validity of the findings

Baseline(experimental setup)
Authors must justify with reasoning for selecting the baseline method they adopted.

Results and Discussion:
The proposed method achieves better results than baseline just for 2 speakers and worse for speakers bring greater than 2. This i think is a serious concern and authors must explain clearly why still their method is relevant.

Cite this review as

Humayun MA (2024) Peer Review #1 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.1)". PeerJ Computer Science

Reviewer 2 · Jan 30, 2024

Basic reporting

In abstract ,, authors stated that previous schemes often fall short in speech recognition system. How authors claimed this? Is there any previous work that authors have dicussed anywhere in introduction or related work?
Most of the words seems to be AI generated in the introduction, Do authors solely used it for language purposes? If yes, there should be a decalartaion statement at the end of the paper. Moreover, i suggest to use some simple vocabulary , so the novice reader could see insights about the novelty of this work.
There is no related work, how readers will distinguish the prosord work with recent benchmarks?
Authors did'nt provide any insights about dataset classes , variations, missing values or other details that are crucial in the reproducability of this work.
How whisper model is superior to recent models of speech recognition? A clear discussion related to the superiority of proposed framework is missing.
In proposed whisper model,, authors used 1D convolutional network, but there is no clear discription of its applicability, layers, dropout, neurons, channels and other insights.

Experimental design

Needs clear decription with more details, such as dataset classes, neural network insights etc.

Validity of the findings

Results seems to be valid.

Cite this review as

Anonymous Reviewer (2024) Peer Review #2 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.1)". PeerJ Computer Science

Reviewer 3 · Jan 31, 2024

Basic reporting

The authors in this paper titled “Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation” attempts to develop a real-time multilingual speech recognition and speaker diarization system, leveraging OpenAI's Whisper model. The authors are suggested to address the following comments while revising the paper.

The literature review conducted in this paper is not sufficient. The authors are suggested to add more literature on the speech recognition in general (English, Mandarin, Urdu, Arabic etc) and then they can provide a more focused literature on Mandarin with accent. It will be interesting to see how accent is being studied for other languages.

Experimental design

The title reflects that the work done is multilingual. What about the Generalization to Other Accents and Languages. The focus is on Mandarin speech with Taiwanese accents, it would be valuable to assess the generalization of the model to other languages and accents and also present and discuss the results.

Validity of the findings

a. Results and discussion need to be elaborated in more detail, also where possible compared with the existing studies.
b. Mention the weakness/limitations of this study.

Additional comments

The author needs to have a careful review before review submission the manuscript.

Cite this review as

Anonymous Reviewer (2024) Peer Review #3 of "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Jan 2, 2024

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

Summary

Version 0.2 (accepted)

Junaid Shuja · Mar 11, 2024 · Academic Editor

Mohammad Ali Humayun · Mar 2, 2024

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · Mar 10, 2024

Basic reporting

Experimental design

Validity of the findings

Reviewer 3 · Mar 11, 2024

Basic reporting

Experimental design

Validity of the findings

Additional comments

Version 0.1 (original submission)

Junaid Shuja · Jan 31, 2024 · Academic Editor

Mohammad Ali Humayun · Jan 19, 2024

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · Jan 30, 2024

Basic reporting

Experimental design

Validity of the findings

Reviewer 3 · Jan 31, 2024

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation