2 May 2012

The statistics is as good as the user

The statistics is as good as the user

Statistics is used for various purposes. It is used to simplify mass data and to make comparisons easier. It is also used to bring out trends and tendencies in the data as well as the hidden relations between variables. All this helps to make decision making much easier. Let us look at each function of Statistics in detail.

  1. Statistics simplifies mass data: The use of statistical concepts helps in the simplification of complex data. Using statistical concepts, the managers can make decisions more easily. The statistical methods help in reducing the complexity of the data and consequently in the understanding of any huge mass of data.
  2. The statistics makes comparison easier: Without using statistical methods and concepts, collection of data and comparison cannot be done easily. Statistics helps us to compare data collected from different sources. Grand totals, measures of central tendency, measures of dispersion, graphs and diagrams, coefficient of correlation all provide ample scopes for comparison.
  3. Statistics brings out trends and tendencies in the data: After data are collected, it is easy to analyze the trend and tendencies in the data by using the various concepts of Statistics.
  4. Statistics brings out the hidden relations between variables: Statistical analysis helps in drawing inferences on data. Statistical analysis brings out the hidden relations between variables.
  5. Decision making power becomes easier: With the proper application of Statistics and statistical software packages on the collected data, managers can take effective decisions, which can increase the profits in a business.
Seeing all this functionality we can say ‘Statistics is as good as the user’.

‘Statistics is the backbone of decision-making’

+ Statistics is the backbone of Decision-Making +

Due to advanced communication network, rapid changes in consumer behavior, varied expectations of a variety of consumers and new market openings, modern managers have a difficult task of making quick and appropriate decisions. Therefore, there is a need for them to depend more upon quantitative techniques like mathematical models, statistics, operations research and econometric.

Decision making is a key part of our day-to-day life. Even when we wish to purchase a television, we like to know the price, quality, durability, and maintainability of various brands and models before buying one. As you can see, in this scenario we are collecting data and making an optimum decision. In other words, we are using Statistics. Again, suppose a company wishes to introduce a new product, it has to collect data on market potential, consumer likings, availability of raw materials, feasibility of producing the product. Hence, data collection is the backbone of any decision making process.

Many organizations find themselves data-rich but poor in drawing information from it. Therefore, it is important to develop the ability to extract meaningful information from raw data to make better decisions. Statistics play an important role in this aspect. Statistics are broadly divided into two main categories. The two categories of Statistics are descriptive statistics and inferential statistics. Descriptive Statistics: Descriptive statistics are used to present the general description of the data which is summarized quantitatively. This is mostly useful in clinical research, when communicating the results of experiments. Inferential Statistics: Inferential statistics are used to make valid inferences from the data which are helpful in effective decision making for managers or professionals.

Statistical methods such as estimation, prediction and hypothesis testing belong to inferential statistics. The researchers make deductions or conclusions from the collected data samples regarding the characteristics of a large population from which the samples are taken.
So, we can say that the ‘Statistics is the backbone of decision-making’.

Measures of Central Tendency

 Measures of Central Tendency

Several different measures of central tendency are defined below.

Arithmetic Mean: The arithmetic mean is the most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. The symbol 'm' is used for the mean of a population. The symbol M is used for the mean of a sample. The formula for 'm' is shown below: Where ∑X is the sum of all the numbers in the numbers in the sample and N is the number of numbers in the sample. As an example, the mean of the numbers 1 + 2 + 3 + 6 + 8 = 20/5 = 4 regardless of whether the numbers constitute the entire population or just a sample from the population. The table, Number of touchdown passes (Table 1: Number of touchdown passes), shows the number of touchdowns (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season. The mean number of touchdown passes thrown is 20.4516 as shown below. Number of touchdown passes although the arithmetic mean is not the only "mean" (there is also a geometric mean), it is by far the most commonly used.

Therefore, if the term "mean" is used without specifying whether it is the arithmetic mean, the geometric mean, or some other mean, it is assumed to refer to the arithmetic mean.

Median: The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores is above the median as below it. For the data in the table, Number of touchdown passes (Table 1: Number of touchdown passes), there are 31scores. The 16th highest score (which equals 20) is the median because there are 15 scores below the 16th score and 15 scores above the 16th score. The median can also be thought of as the 50th percentile3. Let's return to the made up example of the quiz on which you made a three discussed previously in the module Introduction to Central Tendency4 and shown in Table 2: Three possible datasets for the 5-point make-up quiz. Three possible datasets for the 5-point make-up quiz For Dataset 1, the median is three, the same as your score. For Dataset 2, the median is 4. Therefore, your score is below the median. This means you are in the lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your score is above the median and therefore in the upper half of the distribution. Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is 4+7/2 = 5:5.

Modes: The mode is the most frequently occurring value. For the data in the table, Number of touchdown passes (Table 1: Number of touchdown passes), the mode is 18 since more teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see discussion of continuous variables5). Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The Grouped frequency distribution (Table 3: Grouped frequency distribution) table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650). Grouped frequency distribution

Proportions and Percentages: When the focus is on the degree to which a population possesses a particular attribute, the measure of interest is a percentage or a proportion.
  • A proportion refers to the fraction of the total that possesses a certain attribute. For example, we might ask what proportion of women in our sample weigh less than 135pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.
  • A percentage is another way of expressing a proportion. A percentage is equal to the proportion times 100. In our example of the five women, the percent of the total who weigh less than 135 pounds would be 100 x (3/5) or 60 percent. Notation of the various measures, the mean and the proportion are most important.
The notation used to describe these measures appears below:
  • X: Refers to a population mean.
  • x: Refers to a sample mean.
  • P: The proportion of elements in the population that has a particular attribute.
  • p: The proportion of elements in the sample that has a particular attribute.
  • Q: The proportion of elements in the population that does not have a specified attribute. Note that Q = 1 - P.
  • q: The proportion of elements in the sample that does not have a specified attribute. Note that q = 1 - p.

analysis of variance

Analysis of Variance

Analysis of variance (ANOVA) is a statistical technique that can be used to evaluate whether there are differences between the average value, or mean, across several population groups. With this model, the response variable is continuous in nature, whereas the predictor variables are categorical. For example, in a clinical trial of hypertensive patients, ANOVA methods could be used to compare the effectiveness of three different drugs in lowering blood pressure.

Alternatively, ANOVA could be used to determine whether infant birth weight is significantly different among mothers who smoked during pregnancy relative to those who did not. In the simplest case, where two population means are being compared, ANOVA is equivalent to the independent two-sample t-test.

The analysis of variance is the process of resolving the total variation into its separate components that measure different sources of variance. If we have to test the equality of means between more than two populations, analysis of variance is used.

To test the equality of two means of a population we use t-test. But if we have more than two populations, t-test is applied pairwise on all the populations. This pairwise comparison is practically impossible and time consuming so, we use analysis of variance.
In analysis of variance all the populations of interest must have a normal distribution. We assume that all the normal populations have equal variances. The populations from which the samples are taken are considered as independent.

There are three methods of analysis of variance. Complete randomize design is used when on the variable is involved. When two variables are involved then Randomization complete block design is used. Latin square design is a very effective method for three variables. An analysis of the variation between all of the variables used in an experiment and Analysis of variance is used in finance in several different ways, such as to forecasting the movements of security prices by first determining which factors influence stock fluctuations. This analysis can provide valuable insight into the behavior of a security or market index under various conditions.

The easiest way to understand ANOVA is through a concept known as value splitting. ANOVA splits the observed data values into components that are attributable to the different levels of the factors. Value splitting is best explained by example:

The simplest example of value splitting is when we just have one level of one factor.
Suppose we have a turning operation in a machine shop where we are turning pins to a diameter of .125 +/- .005 inches. Throughout the course of a day we take five samples of pins and obtain the following measurements: .125, .127, .124, .126, .128. We can split these data values into a common value (mean) and residuals (what's left over) as follows:

=       .125       .127      .124      .126        .128

+       .126       .126      .126      .126        .126

       -.001       .001      -.002     .000        .002

From these tables, also called overlays, we can easily calculate the location and spread of the data as follows:
Mean = 0.126 Standard Deviation = 0.0016

1 May 2012

Qualitative & Quantitative Data

Qualitative and Quantitative Data

Data may come from a population or from a sample. Small letters like x or y generally are used to represent data values. Most data can be put into the following categories:

Qualitative Data - Qualitative data are the result of categorizing or describing attributes of a population. Hair color, blood type, ethnic group, and the car a person drives, and the street a person lives on is examples of qualitative data. Qualitative data are generally described by words or letters.

For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Qualitative data are not as widely used as quantitative data because many numerical techniques do not apply to the qualitative data. For example, it does not make sense to find an average hair color or blood type.

Quantitative data - Quantitative data are always numbers and are usually the data of choice because there are many methods available for analyzing the data. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and the number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous. All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values. If you count the number of phone calls you receive for each day of the week, you might get 0, 1, 2, 3, etc.

Example 2: Data Sample of Quantitative Continuous Data The data are the weights of the backpacks with the books in it. You sample the same five students. The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have different weights. Weights are quantitative continuous data because weights are measured.

Example 3: Data Sample of Qualitative Data The data are the colors of backpacks. Again, you sample the same five students. One student has a red backpack, two students have black backpacks, one student has a green backpack, and one student has a gray backpack. The colors red, black, black, green, and gray are qualitative data.

Confidence Interval

Confidence Interval

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. The confidence interval is the plus-or-minus figure usually reported in newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer. In statistics, a confidence interval (CI) is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval (i.e. It is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest, if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level or confidence coefficient. A confidence interval with a particular confidence level is intended to give the assurance that, if the statistical model is correct, then taken over all the data that might have been obtained, the procedure for constructing the interval would deliver a confidence interval that included the true value of the parameter the proportion of the time set by the confidence level. More specifically, the meaning of the term "confidence level" is that, if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will approximately match the confidence level; this is guaranteed by the reasoning underlying the construction of confidence intervals.

Example: Suppose a student measuring the boiling temperature of a certain liquid observes the readings (in degrees  Celsius)  102.5,  101.7,  103.1,  100.9,  100.5,  and  102.2  of  6  different  samples  of  the liquid. He calculates the sample mean to be 101.82. If he knows that the standard deviation for this procedure is 1.2 degrees, what is the confidence interval for the population mean at a 95% confidence level?

In other words, the student wishes to estimate the true mean boiling temperature of the liquid using the results of his measurements. If the measurements follow a normal distribution, then the sample mean will have the distribution N (μσn). Since the sample size is 6, the standard deviation of the sample mean is equal to 1.2/sqrt (6) = 0.49.

Confidence level: The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level. The confidence level is a percentage of confidence in a finding.

For example, if an insurance company's total Loss Reserves should be $10,000,000 in order to attain an 80% confidence level that enough money will be available to pay anticipated claims, then, in 8 times out of 10, after all claims have been settled the total claims paid out will be less than $10,000,000. Conversely, in 2 times out of 10 the total claims paid out will be greater than $10,000,000. In another example, a 70% confidence level of one's house burning would mean that the house would burn approximately once every 3.33 years [1 _ (1-0.70) = 3.33]. When you put the confidence level and the confidence interval together, you can say that you are 95% sure that the true percentage of the population is between 43% and 51%.

A statistical measure of the number of times out of 100 that test results can be expected to be within a specified range. For example, a confidence level of 95% means that the result of an action will probably meet expectations 95% of the time. Most analyses of variance or correlation are described in terms of some level of confidence. The wider the confidence interval you are willing to accept, the more certain you can be that the whole population answers would be within that range. For example, if you asked a sample of 1000 people in a city which brand of cola they preferred, and 60% said Brand A, you can be very certain that between 40 and 80% of all the people in the city actually do prefer that brand, but you cannot be so sure that between 59 and 61% of the people in the city prefer the brand.