VinBigdata shares 100-hour data for the community

Contributing to building a scientific playground for the Speech and Language Processing community in Vietnam, Vingroup Big Data Institute has shared two Vietnamese datasets, supporting VLSP to organize ASR challenge 2020.

One of the two datasets shared by VinBigdata is the speech corpus for the automatic speech recognition task in VLSP-2020. A small speech training dataset of 100 hours named as VinBigdata-VLSP2020-100h that was specially created for the Task-01 belongs to the international workshop VLSP-2020. Utterances were stored as audio files in the wave format with text files containing corresponding transcripts. The dataset includes two speech styles. One is reading speech (about 20 hours). Speakers were set up to read manually prepared transcripts using their smartphones in many environments. Topic of transcripts were news, stories, wiki, etc.Another is a spontaneous speech (about 80 hours) that was crawled from open sources and manually transcribed with an accuracy of 96%. You can download the speech corpus here.

The other one shared is English-Vietnamese Machine Translation. The Machine Translation shared-task includes only one track: text translation from English to Vietnamese in the NEWS domain. Training data consists of two corpora: Parallel corpora, which are in UTF-8 plaintexts, 1-to-1 sentence aligned, one sentence per line, and include in-domain NEWS dataset of size 20k samples with 80% in the training set, 10% in the dev set and 10% in the test set; and out-of-domain parallel datasets roughly of size 4M samples, such as openSub (3.5M), ted-like (55k), evbcorpus (45k), wiki-alt (20k), and basic (8.8k) datasets. Monolingual corpora, which are in the UTF-8 plaintext format, one “sentence” per line, and include 2M Vietnamese web crawling samples. The parallel corpora is now available here. You can also download the Monolingual corpora here.

This year, VLSP 2020 is expected to be held in December in Hanoi. Since 2012, the VLSP community has had annual activities to share the results of applied research, as well as tools and resources in the field of language processing, then plan development strategies for the community. Annual seminars attract hundreds of participants, nearly 5000 members join the Facebook forum of the VLSP community.

