使用机器学习回归模型预测 COVID-19 统计数据:Li-MuLi-Poly
国籍:印度
出处:中国知网
摘要:在本文中,线性回归(LR)、多元线性回归(MLR)和多项式回归(PR)技术被应用于提出模型Li-MuLi-Poly。该模型预测了美国发生的 COVID-19 死亡事件。在机器学习模型、最小均方误差模型和最大似然比模型上进行了实验。根据均方误差、调整均方误差、均方误差、均方根误差(RMSE)和最大似然比的度量以及统计t-test 用于验证结果。在应用于建议的回归模型之前,对数据集进行分析、清理和辩论。所选独立参数的相关性由热图和 Carl Pearson 相关矩阵确定。发现当所有独立参数都用于建模时,LR模型的准确性最适合数据集,但是,与PR模型相比,RMSE和平均绝对误差 (MAE) 较高。当建模中没有考虑太多独立参数时,需要高度的PR模型来最佳拟合数据集。然而,当建模中考虑来自所有维度的独立参数时,低度的PR模型最适合数据集。
关键词:机器学习、线性回归、多项式回归、t检验、新冠肺炎、准确性
介绍:
地方性、流行性、爆发和大流行这两个术语密切相关。地方病是一种在特定地点或地区持续存在的疾病。例如,冰是南极洲的地方病,疟疾是非洲和印度部分地区的地方病。然而,流行病是一种局限于一个地区的疾病,但该疾病的新病例传播速度比预期的要快。在流行病中,问题变得失控,例如,当新冠肺炎仅限于中国武汉市时,它是一种流行病。更进一步,当疾病病例数的增加超过预期时,地方病就会爆发。如果此时疫情没有得到控制,那么它就会成为地方病。当疫情在地域上更加蔓延时[1]。用于分析的数据集应该从不同的角度进行预处理。多视点方法能很好地保持数据[多样化特性[2-6]。许多研究人员分析了疾病的传播模式,并试图预测疾病的影响,以便制定一些政策来对抗它并防止其破坏。许多统计模型是针对它开发的。在本文中,主要调查和分析了基于机器学习的线性和多项式回归模型。开发了一种具有高置信度和高效率的非线性回归模型,用于建模和预测疟疾疾病的发病率 [7]。作者使用长和小时间序列以及空间数据三种类型的数据进行非线性回归分析,并在ANOVA 检验中检验模型。应用支持向量回归机制来预测新冠肺炎病例的数量,发现基于高斯核函数的非线性度最高的非线性模型很好,但这些模型存在数据过度拟合的问题[8]。指数、多项式和自回归积分移动平均 (ARIMA) 回归机制用于预测印度新冠肺炎病例的增长。作者追踪了 新冠肺炎病例的增长,发现它遵循一个幂制度,即从指数到二次,然后二次到线性。使用p拟合模型值、R 平方误差值和ANOVA检验,实验表明ARIMA模型是最好的模型[9]。自回归技术用于提高线性和多元线性回归模型对印度死亡率的预测能力。然而,发现预测死亡率没有通过统计显着性检验[10]。在另一项研究工作中,作者提出了一种用于估COVID-19病例增长的易感感染恢复死亡(SIDR)模型,该模型使用参数基本再生数、死亡率和恢复率在线性回归中以最小二乘为成本函数。在R方和RMSE[11]上检查模型的准确性。还研究了一些基于先进技术的其他研究工作。深度学习和人工智能框架用于对疾病进行分类 [12]。基于长短期记忆(LSTM)的模型[13]和LSTM网络的深度学习方法[14]用于显示新冠肺炎大流行的感染率和死亡率趋势。许多国家通过线性回归模型预测新冠肺炎对确诊、康复和死亡病例的影响[15]。数学模型考虑了对新冠肺炎病例传播有影响的参数,并提出了基于傅立叶分解的非参数模型以最佳拟合可用数据[16]]。已经提出了一种基于信任区域反射(TRR)算法的模型,该模型使用实时优化技术来拟合新冠肺炎数据,并且该机制的不确定性已通过LHS-PRCC系数测试进行量化[17]。的人口密度和气候因素的影响在建模新冠肺炎统计[使用18,19]。在另一项研究中,作者考虑了卫生保健设施的可用性对控制新冠肺炎病例的影响,并开发了SEIR流行模型[20]。
统计机器学习回归模型也已应用于许多领域,包括商业、气候控制、教育和学术界、体育等。在分析相关参数和独立参数之间的模式和关系的另一种建议方法中,作者使用了滞后多项式分数回归(LPFR),它是多项式分数回归(PFR)的扩展。在R方误差和调整后的R方误差度量[21]的基础上,所提出的方法被证明更好。应用多项式回归模型预测应变与钻孔深度的关系,并采用最小二乘法估计模型参数[22]。足球运动员的市场价值是根据身体和过去的表现特征通过多元线性回归模型预测的[23],并将多元线性回归应用于学生的学业评价[24]。在另一项研究中,作者研究了 COVID-19 对全球教育系统的影响[25]。
本文提出的工作提出了基于机器学习的线性和多项式回归模型,这些模型最适合约翰霍普金斯数据集[26]中的COVID-19大流行统计数据。该数据集涵盖了来自58个州的新冠肺炎统计数据。使用线性和多项式回归基于四个独立参数对发生的死亡人数进行预测。实验在机器学习模型、最小均方误差模型和最大似然比模型上进行。根据均方误差、调整均方误差、均方误差、均方根误差、最大似然比和统计t检验用于验证结果。数据集在应用于建议的回归模型之前经过分析、清理和辩论。所选独立参数的相关性由热图和Carl Pearson相关矩阵确定。数据之间相关性的大小表示为二维颜色或颜色强度的变化。矩阵中的高值(在0-1范围内)表明这五个字段具有很强的相关性,可以将其视为拟合模型中的独立参数。
本文的其余部分安排如下。第2节描述了数据预处理的方法和建议的模型。第3节介绍了基于四个准确度评估指标的线性和多项式回归模型的准确度结果。第4节讨论了结果并通过统计t检验验证模型。第5节总结了本文。
所提出的Li-MuLi-Poly模型
在对数据集进行数据收集、分析、清理、整理和独立参数选择等预处理后,我们根据图1所示的流程图提出了基于机器学习的线性和多项式回归模型。以下四个精度指标用于检查模型:R方误差、调整R方误差、均方根误差(RMSE)和平均绝对误差(MAE)。
图一 LR和PR的数据集中使用了以下四个独立参数,并且预测死亡人数“死亡人数”作为相关参数。
该模型在80%的输入数据集上使用机器学习技术对来自特征集“x1”、“x2”、“x3”的单个参数、两个参数、三个参数和四个参数的线性和多项式回归模型进行训练和lsquo;x4。在每个阶段,使用四个评估指标对模型进行评估,并获得截距和系数。输入数据集来自p维实空间,输出也来自实空间。数据来自一些先验未知的联合分布。
.
.
我们尝试在训练样本数据集上学习函数f(x)并在测试数据集上进行验证。
,
其中lsquo;x由 组成;每个对应一个描述数据的属性
也可以写成:
其中设置。
它以向量形式表示为:
lt;
剩余内容已隐藏,支付完成后下载完整资料
Predicting COVID-19 statistics using machine learning regression model: Li-MuLi-Poly
Hari Singh and Seema Bawa
Abstract
In this paper, linear regression (LR), multi-linear regression (MLR) and polynomial regression (PR) techniques are applied to propose a model Li-MuLi-Poly. The model predicts COVID-19 deaths happening in the United States of America. The experiment was carried out on machine learning model, minimum mean square error model, and maximum likelihood ratio model. The best-fitting model was selected according to the measures of mean square error, adjusted mean square error, mean square error, root mean square error (RMSE) and maximum likelihood ratio, and the statistical t-test was used to verify the results. Data sets are analyzed, cleaned up and debated before being applied to the proposed regression model. The correlation of the selected independent parameters was determined by the heat map and the Carl Pearson correlation matrix. It was found that the accuracy of the LR model best-fits the dataset when all the independent parameters are used in modeling, however, RMSE and mean absolute error (MAE) are high as compared to PR models. The PR models of a high degree are required to best-fit the dataset when not much independent parameter is considered in modeling. However, the PR models of low degree best-fits the dataset when independent parameters from all dimensions are considered in modeling.
Keywords: Machine learning, Linear regression, Polynomial regression, t-Test, COVID-19, Accuracy
Introduction
The terms endemic, epidemic, outbreak, and pandemic are very closely related. An endemic is a disease that has a constant presence in a particular location or region. For example, Ice is an endemic to Antarctica and Malaria is an endemic to Africa and in some parts of India also. However, an epidemic is a disease that is localized to a region but the number of new cases of the disease spreads very fast than expected. In an epidemic the problem becomes out of control, for example, the time when the COVID-19 was limited to Wuhan city of China only, it was an epidemic. Going one step further, the endemic becomes an outbreak when the rise in number of cases of the disease is more than anticipated. If at this point the outbreak is not controlled then it becomes an endemic. When the epidemic is more geographically spread, over multiple countries or continents, then it becomes a pandemic [1].
The dataset to be used for analysis should be viewed from different angles for pre-processing. Multi-view methods can well preserve the diverse characteristics of data [2–6]. Many researchers have analyzed the spread pattern of diseases and tried to predict the impact of diseases so as to develop some policies to combat it and prevent the destruction from it. A number of statistical models are developed towards it. In this paper, mostly machine learning-based linear and polynomial regression models have been surveyed and analyzed. A non-linear regression model for modeling and forecasting the malaria disease incidence with a high confidence level and high degree of efficiency is developed [7]. The authors used three types of data, long and small-time series, and spatial data on non-linear regression analysis, and tested the models on statistical ANOVA tests. A support vector regression mechanism is applied to predict the number of COVID-19 cases and found that non-linear models, having the highest degree of non-linearity on the basis of Gaussian Kernel function are good but these suffer from over-fitting of data [8]. An exponential, polynomial, and auto-regressive integrated moving averages (ARIMA) regression mechanism is used for predicting the growth of COVID-19 cases in India. The authors traced the growth of COVID-19 cases and found that it follows a power regime i.e. from exponential to quadratic and then quadratic to linear. Models were fitted using p values, R-Square error values, and ANOVA test, and experimentation revealed that the ARIMA models are the best one [9]. An auto-regression technique is used to improve the predictive ability of linear and multi-linear regression model for predicting the death-rate in India. However, it was found that predicted death-rate did not pass the test of statistical significance [10]. In another research work, the authors proposed a susceptible-infectious-recovered-dead (SIDR) model for estimating the growth in the COVID-19 cases that uses parameters basic reproduction number, mortality, and recovery rates on linear regression with least square as the cost function. The accuracy of the model is checked on R-Square and RMSE [11].
Some other research works based on advanced techniques have also been studied. A deep learning and artificial intelligence framework is used for categorizing the illness [12]. A long short-term memory (LSTM) based model [资料编号:[595570],资料为PDF文档或Word文档,PDF文档可免费转换为Word
课题毕业论文、文献综述、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找。