The quality of online health information is important in healthcare. Nowadays, people refer to the internet to ask for everything in their daily life including health information. Due to the openness of the internet any person can easily create a website and produce any content. This includes health websites or health information content on blogs or forums which could be distributed without being carefully verified and that could have a severe influence on people’s health. A recent study found that health information quality varied across medical domains and across websites, the study stated that the overall health information quality is still problematic.
Health information quality is evaluated using standard instruments with predefined criteria such as JAMA benchmarks, HON code, and DISCERN with which the evaluation is done manually. There is scope to develop a tool that can automatically assess the quality of health documents on the internet to save time of effort of users. The aim of this research is to demonstrate that natural language processing and machine learning techniques can be used to improve the measurement of information quality in health documents utilising existing metrics as well as new content-based metrics. We also want to investigate whether such metrics can give a deeper semantic notion of information reliability of health information documents.