Search 72,490 tutors
FIND TUTORS
Ask a question
0 0

Method of least squares

A magazine reported on a study of the reliability of a commercial kit to test for arsenic in groundwater. The field kit was used to test a sample of 20 groundwater wells in a country. In addition to the arsenic level (micrograms per liter) the latitude (degrees) longitude (degrees) and depth (feet) of each well was measured

Data table:
Well_ID   Latitude   Longitude   Depth     Arsenic
1             23.7714   90.6221       30          78
2             23.7764   90.6492       60          65
3             23.7832   90.6321      125        192
4             23.7944   90.6386        30          21
5             23.7644   90.6224      100         187
6             23.7916   90.6228      100         187
7             23.7674   90.6225        45         123
8             23.7694   90.6176        90         285
9             23.7732   90.6147        75         182
10           23.7894   90.6482       125         292
11           23.7852   90.6215         75         182
12           23.7856   90.6207         45           76
13           23.7745   90.6496         90           55
14           23.7479   90.6387         60         179
15           23.7734   90.6487         90         185
16           23.7681   90.6463         90         285
17           23.7754   90.6452         90         185
18           23.7881   90.6489        125        192
19           23.7872   90.6218          30          55
20           23.7818   90.6243        125         192
 
 
b.Fit the model to the data using the method of least squares. Let x1 be the latitudes, x2 be the longitudes, x3 be the depths

Answer. y= (?) +(?) x1 (?) x2+(?) x3  Roundthe constant term and the coefficients for x1 and x2. Round the coefficient for x3 to the nearest hundredth as needed
 
 
Tutors, please sign in to answer this question.

4 Answers

I am not sure what level you are from the question as asked.  The methods presented so far have been sort of down-in-the-ditches, concentrating on getting the results rather than understanding the problem.  
 
The problem states "Fit the data using the method of least squares".  It does not say to fit the model using Excel, RStudio, SPSS, or any other software product.  Be also careful because Excel computations for this kind of problem have been called into question.  
 
If your goal is to just "get the answer", then definitely use software to get the answer.  But if you want to understand least squares, then you should work it mostly by hand with only a little help from a calculator.
 
It's important to understand the framework of Statistics.  Statisticians have spent many years studying the various assumptions, and yet it's easy for a careless researcher to misuse Multiple Regression to prove hypotheses which are not valid.
 
Each Multiple Regression assumes a particular mathematical model.  If the model is not as expected, the results will not be reliable and things will go very wrong. A regression model can be linear or non-linear, and can have a number of odd twists if certain assumptions are not satisfied, or if the wrong independent variables are in the model, or if the independent variables are correlated, or if the errors are correlated.   
 
I guess that the person who wrote the problem is assuming a linear model, as specified in the assumptions described here: http://www.acastat.com/Statbook/molsassumptions.htm.  One needs to continually validate assumptions.
 
If the assumptions for a Multiple Regression with a linear model are satisfied, then the general equation for Multiple Regression with a dependent variable Y and a matrix of coefficients of the dependent to independent variables X is Y = Xβ + u, where u is a vector of the error values for the Y observations.  
 
The general solution to a Multiple Regression problem in the linear case where β-hat is the estimate of β and XT  is the transpose of the matrix X is 
 
                                        β-hat = (XTX)-1XTY   
 
This general solution is obtained by taking the derivative of the sum of squared residual values, and setting it to zero to find the minimum, which is why the technique is called "least squares".  (This technique uses the Euclidean distance as a measure of the deviations, and one should note that if a different metric for distance were used the results could be different.)
 
In this case, Y is the 20x1 vector of dependent variable values containing 78, 65, ... , 192, and X is the 20x3 matrix of the dependent variables latitude, longitude, and depth.   The vector β is the 3x1 matrix of unknown coefficients β1 β2 and β3.  
  
Matrix operations are very straightforward, and make the formulas you need to solve these problems much easier to remember.
 
Since the problem does not state any null or alternate hypothesis, it makes no sense to talk about significance in the solution to this problem.  The problem states that the purpose of taking the observations is to test the reliability of the new commercial kit, but does not specify what the criteria are for evaluating that reliability.  See http://www.uni.edu/chfasoa/reliabilityandvalidity.htm  or http://en.wikipedia.org/wiki/Reliability or http://web.utk.edu/~leon/rel/overview/reliability.html ;for a definition of the term "reliability".  It appears that "reliability" is a very vague, nearly undefined, term.   So the purpose of the curve fitting is not clear from the problem.
  
 
 
 
 
 
 
 
 
 
The regression equation is given by
 
y = -2074 x1 - 1107 x2 + 1.95 x3 + 149623
 
This is the output using Microsoft Excel multiple regression analysis:
 
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.783344535
R Square 0.61362866
Adjusted R Square 0.541184034
Standard Error 54.12807997
Observations 20

Coefficients
Intercept 149622.7039
LATITUDE -2074.359732
LONGITUDE -1106.621483
DEPTH 1.95344088

Solution using R.
 
1. Copy and paste this data in Notepad and save. Let's say you saved it as "WellData.txt".
 
2. Launch RStudio.
 
3. On the menu bar, select Tools > Import Dataset > From Text File...
 
4. Select WellData.txt and click Open.
 
5. In the RStudio Command prompt type model <- lm(WellData$Arsenic ~ WellData$Latitude + WellData$Longitude + WellData$Depth, data=WellData) and press enter. This creates a regression object and stores it in the variable model.
 
6. Now type summary(model) to see the regression summary. It should look like this
 
Call:
lm(formula = WellData$Arsenic ~ WellData$Latitude + WellData$Longitude +
WellData$Depth, data = WellData)

Residuals:
        Min           1Q  Median          3Q          Max
-111.853   -21.778    0.697    14.938    101.219

Coefficients:
                                     Estimate     Std. Error   t value    Pr(>|t|)
(Intercept)                149622.704    95674.027    1.564    0.13741
WellData$Latitude    -2074.360      1114.034  -1.862    0.08107 .
WellData$Longitude  -1106.621     1006.227   -1.100    0.28770
WellData$Depth                  1.953           0.401    4.871    0.00017 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.13 on 16 degrees of freedom
Multiple R-squared: 0.6136, Adjusted R-squared: 0.5412
F-statistic: 8.47 on 3 and 16 DF, p-value: 0.001343
 
-----------------------------------------------------------------
From this we have y = 149622.70 - 2074.36x1 - 1106.62x2 + 1.95x3.
Mik,
The method of least squares applied to this situation is "Multiple regression". You can use the "Data analysis" tool pack of Excel to find the equation of the regression hyperplane (y= (?) +(?) x1 (?) x2+(?) x3). Make sure that "Data Analysis" has been activated in the Excel of your computer. If it is activated, you should be able to see its menu on the far right of Excel menus when you click on the "Data" tab of Excel. If it has not been activated and depending on the version of Excel in your computer (2003/2007/2010), you can activate it by clicking on Microsoft icon (top/left) (or on the "File" menu in the 2010 version), and then successively click on "Options"--"Add-ins"--"Go". At this point the data analysis dialogue box will appear and you must check the "Data analysis tool-pack" option. It may then take a few minutes for your computer to add the data analysis tool pack.
 
Next, enter the data in five different columns. The first column representing the "Well ID" is not important and does not play any role. Suppose this is Column A of Excel. Also suppose the Columns B, C, D, and E respectively contain the X1 (latitude), X2 (longitude), X3 (depth), and Y (arsenic) values; with all 4 top cells being the labels; so that, the numeric data are in cell B2 through B21 for the latitudes, cells C2 through C21 for the longitudes, cells D2 through D21 for the depths, and cells E2 through E21 for the arsenic levels.
 
Next, click on the "Data" menu, and then click on the "Data Analysis" tab. When the Data analysis dialogue box opens, choose "Regression" in the available list. In the regression dialogue box that opens, enter columns B, C, and D as the X-values. This can be done by clicking into cell B1, dragging the mouse to the right until C1 and D1 are also included, and then dragging all the way down until all three columns representing X1 (latitude), X2 (longitude), X3 (depth) values have been included. If done correctly, the box containing the three X values should indicate something like: "Sheet 1!$B$1:$D$21". If you have a hard time dragging in with the mouse, you can simply type in "B1:D21" into the box for X (independent variables). Next, type in "E1:E21" into the box for Y (dependent variable), or use the dragging with the mouse method. Finally, Click on "OK".
 
Excel will now generate a new sheet containing all the necessary information for the multiple regression. The information that you would need to answer the questions of your problem are located at the bottom of that sheet under the "Regression Output". The value indicated as the "Intercept" is the first "?" of the equation; and the other three values indicated as the coefficients b1, b2, and b3 corresponding to the three independent variables X1 , X2, and X3 would replace the other three "?" marks in the equation.
 

Seattle Probability & Statistics tutors