Medical-Knowledge-Graph
- This is our research team’s achievement of constructing knowledge graph in medical area. It mainly uses some simple statistical method including naive Bayes and Noisy OR.
A knowledge graph construction method for smart medical care
- Technical field
- Background technique
- Contents of the invention
- Main contribution of this article
Technical field
The invention belongs to the field of knowledge-oriented graph technology, and relates to a method for constructing a knowledge map, and relates to learning of related statistical models.
Background technique
With the rapid development of artificial intelligence, the key issues of knowledge extraction, representation, fusion, reasoning, question and answer, etc. involved in knowledge mapping have been solved and broken to a certain extent. Knowledge map has become a new hotspot in the field of knowledge service, and has been accepted by scholars at home and abroad. The industry is widely concerned.
Knowledge map is the frontier research problem of intelligent big data. It conforms to the development of information age with its unique technical advantages, such as incremental data pattern design; good data integration; existing RDF, OWL and other standards support; semantics Search and knowledge reasoning capabilities, etc. In the medical field, with the development of regional health information and medical information systems, a large amount of medical data has been accumulated. How to extract information from these data, and manage, share and apply it is the key issue to promote medical intelligence. It is the basis for medical knowledge retrieval, clinical diagnosis, medical quality management, electronic medical records and intelligent processing of health records.
In recent years, the demand for medical clinical decision support and assisted diagnostic systems has increased dramatically. Existing platforms or systems rely on a knowledge base that is manually edited by a large number of professionals or generated using simple statistical data. By constructing a knowledge map in the medical field, medical knowledge can be extracted from massive data and managed, shared and applied reasonably and efficiently. This paper directly explores the relationship between disease and symptoms, disease and examination from electronic medical records, and studies and implements the medical knowledge map construction process based on electronic medical records, which provides a basis for subsequent intelligent question and answer, disease diagnosis and other applications.
Electronic medical records generally include disease record, inspection test results, medical orders, nursing records, etc., which have both structured information and unstructured free text. The first is the preparatory work related to data collection and data processing. Based on this, the maximum likelihood estimation of the two probabilistic models is used to automatically learn the entity relationship: Naive Bayes and Bayesian networks using NoisyOR. A knowledge map of disease-symptom relationships and disease-check relationships is derived from the extracted entities and the learned entity relationship weights.
Contents of invention
This study explores an automated process for constructing a Chinese medical knowledge map based on electronic medical records that correlates disease with symptoms that may be caused by the disease or related medical examinations.
1.Data collection and preparation
1.1 Extract concepts from electronic medical records
We extracted the positively mentioned diseases and symptoms (concepts) from structured and unstructured data. The structured data consists of the ICD-9 (International Classification of Diseases) diagnostic code. Unstructured data is reviewed by the chief complaint, triage assessment, nursing notes, and physician qualifications. Classification assessment refers to the free text recording the care assessment at the time of the triage.
The set of diseases and symptoms considered is selected from GHKG to establish the basis for subsequent comparisons. We use string matching to search for concepts by their common names, aliases, or acronyms, where aliases and acronyms are derived from GHKG and known mappings from the Unified Medical Language System (UMLS). Similarly, if a link to the ICD-9 code is provided, we will search for the code in the structured management data of the record. In addition, what happens within the negative range is not counted.
1.2 Google (GHKG) Health Knowledge Map
A novel aspect of our research is the use of extensive and manually curated health knowledge charts provided and licensed by Google. The Google Health Knowledge Chart was first released in 2015 to help users make healthy decisions. Google created the chart using a multi-step process that took a lot of manual adjustments to the data mining technology and expert team. This map is intended for use by patients searching for Google for health information (ie, for patients).
We use a subset of GHKG, and for this we provide full support in the data. We define full support for the disease as having at least 100 positive mentions and at least 10 positive mentions of symptoms. This resulted in 156 diseases and 491 symptoms. The graph consists of a medical concept (disease and symptoms) as a node and a disease symptom relationship as a margin.
A few concepts in GHKG are classified as diseases and symptoms (for example, “type II diabetes” is a disease, but also a symptom of ‘polycystic ovarian cancer’). In these cases, we only designate these concepts as diseases. Each concept includes a common name for the concept, an alias, and ICD-9 code and UMLS concepts that can be mapped to. In addition, a measure of the expected frequency of the concept is provided for the disease and symptoms. For symptom nodes, the expected frequency of conditions for giving symptoms to the disease is provided as “frequent” or “always”. For disease nodes, the frequency is described by age (‘old’, ‘adult’, ‘young’, etc.) as “very frequent”, “frequent”, “rare”, “very rare” or “never”.
2.Entity relationship learning algorithm
For the relationship between disease and symptoms (the relationship between disease and examination), the weight of the entity relationship used in the model study is used to measure the probability of causing a symptom in the case of a disease. Therefore, the model is essentially a mechanism for statistical learning based on the co-occurrence of the concept of disease and symptom entities in electronic medical records, and also to learn and measure the causal relationship between disease and symptoms. Diseases and symptoms as nodes, the relationship between the two and their weights as the edge between the nodes, and the disease is the cause, the symptoms or the results of the examination are fruit, so the side points from the disease node to the symptom or check the directed edge of the node, A variety of different diseases and symptoms, examinations and other nodes and corresponding edges naturally constitute a directed acyclic graph.
2.1 Causal analysis of the model:
When broadly inferring differences in model performance, the disease-symptom knowledge map is essentially a causal map that describes how the disease causes symptoms. Defining a good knowledge map using causal queries can be done through Pearl’s ‘do’ operant. We intervene in this operator and set a disease “appear” or “do not appear” and see how it affects the likelihood of a symptom. This possibility is expressed by the measure of importance (IMPT) between disease and symptoms.
The weights of the edges i, j in the real map are closely related to the likelihood ratio of the symptoms, because we intervene and do the intervention: set the disease to “appear” or “do not appear”. By using Pearl’s do-calculus to develop an importance measure, we are able to unravel the relationship between disease and symptoms simply because the disease often occurs with other diseases that cause it.
3.Entity learning based on naive Bayesian model
3.1 Naive Bayes’ assumptions in constructing medical knowledge graph:
The definition of Naive Bayes is: conditional independence between symptom subnodes. In the case of the present study, this translates into symptoms that are conditionally independent of each other given the parent node of the disease. This is an oversimplification because for a disease, the appearance of one symptom may make another symptom more likely. For example, for the disease “bronchitis”, “congestion” and “headache” are common symptoms. Although these two conditions do not always occur at the same time as the disease occurs, “congestion” increases the likelihood of “headache”.
3.2 Specific steps for constructing a medical knowledge map using the naive Bayesian model:
According to the analysis of the application of the naive Bayesian model in this study and the calculation scheme of IMPT, the general idea process of learning the relationship between disease-symptoms and disease-inspection and constructing the corresponding knowledge map is given, which is mainly divided into four step:
Step A: Extract physical concepts such as diseases, symptoms, and examinations that have been identified in each medical record in a non-negative range.
Step B: Using the extracted entities to form a co-occurrence matrix of patient-disease, patient-symptom and patient-examination.
Step C: First set the threshold of the number of co-occurrences of the disease entity and the symptom or check entity as a test for denoising measures, then count all the results according to the co-occurrence matrix and calculate the value of the conditional probability in each pair of disease-symptoms or diseases-tested IMPT, thereby calculating a value corresponding to the importance measure of each pair of disease-symptoms or diseases-examination.
Step D: A knowledge map of the relationship between each disease formation and disease-symptoms and disease-examination.
4.Entity relationship learning based on NoisyOR model
4.1 The assumption of NoisyOR in constructing a medical knowledge graph:
One of the core assumptions of the NoisyOR model is the effect independence hypothesis. This hypothesis is oversimplified because the symptoms may behave very differently from each disease in the case of multiple diseases.
By learning the model parameters using maximum likelihood estimation and deriving the importance metrics from the conditional probability distribution, we do not make assumptions about the prior distribution of the disease. This is an important point to distinguish between NoisyOR and Naïve Bayes, which implicitly believes that the disease is independent. Of the environmental settings considered in this study, the disease is of course not independent. For example, given that patients often exhibit very few diseases, the presence of a disease usually reduces the likelihood of other diseases.
NoisyOR is a conditional probability distribution that describes the causal mechanism by which the parent node affects the state of the child nodes. In the method of constructing the medical knowledge map proposed in this paper, the mechanism belonging to the parent disease node affects the performance of its sub-symptom node. In a deterministic, “noise” environment, the onset of a potential disease will always cause its symptoms to be observed, and if any of the parental diseases of a symptom “appear”, the symptoms can be observed. For example, if a patient is infected with “flu” or has “mononucleosis”, it will “fever.”
However, in real life, this process is far from certain, which is where the “Noisy” part appears: even with the “flu”, patients may not have a “fever”. Also, the appearance of “fever” may be neither because of “flu” nor because of “mononucleosis.” NoisyOR handles the inherent “noise” in the process by introducing fault and leakage probabilities.
4.2 Specific steps for constructing a medical knowledge graph using the NoisyOR model:
According to the analysis of the application of NoisyOR model in this study and the calculation scheme of IMPT, the general idea process of constructing the knowledge map of disease-symptoms and disease-inspection relationship is given, which is mainly divided into four steps:
Step A: Extract physical concepts such as diseases, symptoms, and examinations that have been identified in each medical record in a non-negative range.
Step B: Using the extracted entities to form a co-occurrence matrix of patient-disease, patient-symptom and patient-examination.
Step C: Calculate all results according to the co-occurrence matrix and calculate the value of the conditional probability in each pair of disease-symptoms or diseases-examined IMPT to calculate a value corresponding to the importance measure for each pair of disease-symptoms or disease-examinations.
Step D: A knowledge map of the relationship between each disease formation and disease-symptoms and disease-examination.
Main contribution of this article
1) The medical knowledge graph is the cornerstone of smart healthcare and is expected to bring more efficient and accurate medical services. However, existing knowledge map construction techniques generally have problems such as low efficiency, limited limitations, and poor scalability in the medical field. For medical data cross-language, professional and complex structure. With the development of medical information, medical electronic data has accumulated. Building a knowledge map in the medical field, it is possible to extract medical knowledge from massive data, and manage, share and apply it reasonably and efficiently. It is of great significance to today’s medical industry and is a research hotspot of many enterprises and research institutions. The medical knowledge map combines the knowledge map with medical knowledge, and will promote the automation and intelligent processing of medical data, bringing new development opportunities for the medical industry.
2) This paper mainly provides a method for constructing medical knowledge graphs, and mainly uses two machine learning methods(statistical methods) to learn medical entity relationships. Based on the concept of the extracted medical entity, select the medical entities (sickness, symptoms and examinations) that appear positively in the medical record, remove the entities in the negative text range, and use machine learning algorithms such as Naive Bayes and NoisyOR. Learn about disease-symptoms and disease-check relationships to establish appropriate disease-symptoms and diseases-inspection knowledge maps based on entity and entity relationships.
Copyright statement: This article is the original article of the blogger, please attach the blog post link!