Current big data projects, with size up to petabyte or even exabyte, undertaken in various disciplines including biomedical field, have posed great challenges in terms of storing, managing, analyzing, and processing data. In addition to today’s trend using open data, open science, and open platform, sharing and standardizing diverse and heterogeneous data while ensuring data security also constitutes many challenges that need addressing. Developing a system for managing, analyzing, and sharing data that ensures portability, scalability, and reproducibility has become urgent and emphasized in a number of major biomedical projects recently. The 1000 Vietnamese Genome Project and other big data projects that are being and will be implemented at Vingroup Big Data Institute face similar challenges; therefore, the development of a system of the like is particularly important to ensure long-term efficiency of the project.
The overall objective of this project is to develop a system for management, analysis, and sharing of large datasets (MASH), which initially focuses on health data, and will be gradually expanded to other data sources. MASH needs to (1) be able to work with data models of each project and integrate with the analysis workflows of such project; (2) be designed to flexibly adapt to changes in data models and analysis workflows; (3) provide front-end that allows importing/ exporting/ displaying relevant data, and back-end that allows indexing/ storing/ managing/ securing large scale data, each of which could be up to terabytes, and the total capacity could be up to petabytes or even exabytes; (4) be implemented based on the most advanced open-source technologies available to ensure portability, scalability and reproducibility in managing and analyzing big data; and (5) be deployed on-premise or cloud-based computing.
MASH makes it easy, convenient, and rapid to manage, share, explore, visualize, and analyze data. MASH allows partners including bioinformatics and biomedical researchers, data scientists, doctors, or students, etc. to exploit, search, and analyze data on the website, through which partners could save a significant amount of time and money to perform their research. Furthermore, partners are able to upload their own data to the system, utilize the system’s resources and services to perform analysis, and share data with the community. Through MASH, partners could perform specialist analysis by employing the readily available features regardless of prior knowledge of programming techniques, visualization, and in-depth data analysis. MASH is developed and deployed with many layers of security to ensure data integrity and security as well as partners’ privacy.
This is currently the largest system for biomedical data management, analysis and sharing in Vietnam, using most advanced technologies in the world. This system is expected to become one of the most valuable reference portals for applied biomedical research and development, benefiting the community of researchers and professionals both in Vietnam and around the world.
- CAPABLE OF PROCESSING MILLIONS OF GIGABYTES OF DATA WITH TENS OF THOUSANDS OF SAMPLES
- CAPABLE OF PROVIDING A WHOLE-GENOME-ANALYSIS SERVICE WITH HIGH ACCURACY IN LESS THAN A DAY
- SUPPORT BIOMEDICAL RESEARCHERS AND DOCTORS AND GENETIC SPECIALISTS IN DETERMINING DISEASE RISKS AND DRUG SIDE-EFFECTS