Elements of Statistics Tutor Marked Assignment (TMA) Course Code :ECO-07 Assignment Code :ECO-07/TMA/2012-13 Total Marks: 100 Attempt all the questions

1. “All statistics are numerical statements of facts but all numerical statements of facts are not statistics” comment. (20). Solution: According to this definition the numerical facts (data) should possess the following characteristics to be treated as statistics. (i) Aggregate of facts: Single, isolated or unrelated figures are not statistics, because they are not comparable. These figures tell nothing about any problem. For example the age of a student or the price of a commodity is not statistics. Because they are just abstract numbers. But when we consider age of a group of students, or the prices of a basket of commodities it is statistics as they comparable. Statistics must be expressed as aggregate of facts relating to any particular enquiry. Thus ‘not a datum’ but the data represent statistics.

(ii) Affected by multiplicity of causes : Numerical facts should be affected by a number of factors to become statistics.These may include both normal as well as exceptional factors. For example, the yield of rice depends on a number of factors like the rainfall, fertility of the soil, method of cultivation, quality of seeds used etc.Some of these factors are normal and some are exceptional. Hence the data relating to the yield of rice over a period of time become statistics. On the other hand if we write numericals l,2,3,4,5,6,7,8,9,and 10, they are not statistics. Because they are not affected by any factors.

(iii) Numerically expressed: Statistics are quantitative phenomena. Mostly, statistical techniques deal with quantitative factors than with qualitative aspect. So statistics should be always numerically expressed. For example, ‘there are 30 districts in Orissa’, is a numerical statement. But the standard of living of the people of Orissa have improved over the years’ is not a numerical statement. Here the first statement is statistical where as the second is not. So the subjective statements relating to qualitative information like honesty, beauties etc. are not statistics. Only statements which can be expressed numerically are statistics.

(iv) Enumerated accurately: In an enquiry statistics (data) should be collected with a reasonable standard of accuracy. This affects the findings of the enquiry. The degree of accuracy of statistics depends on the nature and purpose of the enquiry. Generally data are collected in two ways – by enumerating all the units of the population (complete enumeration method) or enumerating some units (sampling Method) and the result is generalized for the whole group. No doubt the first method involves more time and cost but provides more accurate information than the second. Depending on the nature of enquiry and the degree of accuracy desired only one of the above two methods is employed. But the collected statistics should be as far as possible accurate.

(v) Collected in a systematic manner : Information (data) constitute the basis of any statistical enquiry. They should be collected in a scientific and systematic manner. For this, the purpose of the enquiry must be decided in advance. The purpose should be specific and well defined. The information should be collected by trained, skilled and unbiased investigators. Other wise irrelevant and unnecessary information may be collected and the very purpose of statistics is defeated.

(vi) Collected for a predetermined purpose : Statistics relating to an enquiry are always collected with a predetermined purpose. So it is essential to define clearly the purpose or the objective of the enquiry before actually collecting data. This ensures the inclusion of all essential information and the exclusion of all irrelevant and confusing data. This will make the analysis specific and result oriented.

(vii) Placed in relation to each other : Statistics should be comparable. They may be compared with respect to time of occurrence or place of collection. This requires the data should be homogeneous and are placed in relation to each other. Because heterogeneous data are not comparable.For example, data relating to production of rice and the number of students taking admission in a class are not statistics. Because they are not comparable. On the other hand, the food grain production of a state for the last ten years constitute statistics as they are comparable. So statistical data should express some phenomenon. In other words, “All statistics are numerical statements of facts but all numerical statements of facts are not statistics”.

=====================================================================

2. (a) What is statistical table? How is it constructed? Discuss the requisites of a good statistical table. Solution: These tables are prepared twice each year, with one volume reporting data for the 12-month period ending June 30, and the other volume reporting data for the calendar year ending December 31.Detailed statistical tables address the work of the U.S. courts of appeals, district courts and bankruptcy courts, as well as the federal probation and pretrial services system.

The Judicial Caseload Indicators table compares data for the current 12-month period to that for the same period 1, 5, and 10 years earlier. Constructed: In general, a statistical table consists of the following eight parts. They are as follows: (i) Table Number: Each table must be given a number. Table number helps in distinguishing one table from other tables. Usually tables are numbered according to the order of their appearance in a chapter. For example, the first table in the first chapter of a book should be given number 1.1 and second table of the same chapter be given 1.2 Table number should be given at its top or towards the left of the table.

(ii) Title of the Table: Every table should have a suitable title. It should be short & clear. Title should be such that one can know the nature of the data contained in the table as well as where and when such data were collected. It is either placed just below the table number or at its right.

(iii) Caption: Caption refers to the headings of the columns. It consists of one or more column heads. A caption should be brief, concise and self-explanatory, Column heading is written in the middle of a column in small letters.

(iv)Stub: Stub refers to the headings of rows.

(v) Body This is the most important part of a table. It contains a number of cells. Cells are formed due to the intersection of rows and column. Data are entered in these cells.

(vi) Head Note: The head-note (or prefactory note) contains the unit of measurement of data. It is usually placed just below the title or at the right hand top corner of the table.

(vii) Foot Note A foot note is given at the bottom of a table. It helps in clarifying the point which is not clear in the table. A foot note may be keyed to the title or to any column or to any row heading. It is identified by symbols such as *,+,@,£ etc.

Requisites of Good Statistical Table: You have studied the parts of a statistical table. Now let us discuss the features of a statistical table. There are certain general guidelines in preparing a good statistical table. They are as follows:

I) A good table must present the data in a clear and simple manner.

2) It should have a brief and clear title. The title should be self-evplanatory and should represent the description of the contents of the table.

3) The stub, stub entries, captions and caption heads should be brief and

clear. The columns may be numbered to facilitate easy reference in the text.

4) The headnote should be precise and complete as it relates to the unit of the data.

5) The totals and sub-totals should be given at the appropriate places.

6) The references should be clearly stated so that the reliability of the data could be verified if needed.

7) If necessary, the derived data (ratios, percentages, averages, etc.) may also be incorporated in the tables.

8) .As far as possible abbreviations should be avoided in a statistical table. If it is essential to use abbreviations, their meaning must be explained in footnotes.

9) Wherever necessary, proper ruling should be provided in a table. Normally, the columns are separated from one another by lines. These lines make the table more readable and attractive, and also show the relations of the data more clearly. Always lines are drawn at the top and bottom of the table, and also below the captions.

10) Use of ditto mark should be avoided.

11) Columns and rows which are to be compared with one another should be placed side by side.

12) If it is necessary to emphasise the relative significance of certain categories, different kinds of type spacing and indentation should be used.

13) All the column figures should be properly aligned. Decimal points and plus-minus signs also should be in perfect alignment.

14) Generally not more than four to five characteristics may be shown at a time in a table, otherwise it will become too complex.

======================================================================

(b) What is skewness? Explain the various methods of measuring skewness. Solution: In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined. Qualitatively, a negative skew indicates that the tail on the left side of the probability density function is longer than the right side and the bulk of the values (possibly including the median) lie to the right of the mean. A positive skew indicates that the tail on the right side is longer than the left side and the bulk of the values lie to the left of the mean.

A zero value indicates that the values are relatively evenly distributed on both sides of the mean, typically (but not necessarily) implying a symmetric distribution. Some distributions of data, such as the bell curve are symmetric. This means that the right and the left are perfect mirror images of one another. But not every distribution of data is symmetric. Sets of data that are not symmetric are said to be asymmetric. The measure of how asymmetric a distribution can be is called skewness. As we will see, data can be skewed either to the right or to the left. The mean, median and mode are all measures of the center of a set of data. The skewness of the data can be determined by how these quantities are related to one another.

Measures of Skewness: It’s one thing to look at two set of data and determine that one is symmetric while the other is asymmetric. It’s another to look at two sets of asymmetric data and say that one is more skewed than the other. It can be very subjective to determine which is more skewed by simply looking at the graph of the distribution. This is why there are ways to numerically calculate the measure of skewness. One measure of skewness, called Pearson’s first coefficient of skewness, is to subtract the mean from the mode, and then divide this difference by the standard deviation of the data. The reason for dividing the difference is so that we have a dimensionless quantity. This explains why data skewed to the right has positive skewness. If the data set is skewed to the right, the mean is greater than the mode, and so subtracting the mode from the mean gives a positive number. A similar argument explains why data skewed to the left has negative skewness.

Pearson’s second coefficient of skewness is also used to measure the asymmetry of a data set. For this quantity we subtract the mode from the median, multiply this number by three and then divide by the standard deviation. 1. Introduction: In an introductory level statistics course, instructors spend the first part of the course teaching students three important characteristics used when summarizing a data set: center, variability, and shape. The instructor typically begins by introducing visual tools to get a “picture” of the data. 2. Visual Displays: A textbook discussion would typically begin by showing the relative positions of the mean, median, and mode in smooth population probability density functions. The explanation will mainly refer to the positions of the mean and median. There may be comments about tail length and the role of extreme values in pulling the mean up or down. The mode usually gets scant mention, except as the “high point” in the distribution.

3. Skewness Statistics: Since Karl Pearson (1895), statisticians have studied the properties of various statistics of skewness, and have discussed their utility and limitations. This research stream covers more than a century. For an overview, see Arnold and Groenveld (1995), Groenveld and Meeden (1984), and Rayner, Best and Matthews (1995). Empirical studies have examined bias, mean squared error, Type I error, and power for samples of various sizes drawn from various populations.

4. Type I Error Simulation: We obtained preliminary critical values for Sk2 using Monte Carlo simulation with Minitab 16. We drew 20,000 samples of N(0,1) for n = 10 to 100 in increments of 10 and computed the sample mean ( x ), sample median (m), and sample standard deviation (s). For each sample size, we calculated Sk2 and its percentiles. Upper and lower percentiles should be the same except for sign, so we averaged their absolute values (effectively 40,000 samples).

5. Type II Error Simulation: Type II error in this context occurs when a sample from a non-normal skewed distribution does not lead to rejection of the hypothesis of a symmetric normal distribution. There are an infinite number of distributions that could be explored, including “real world” mixtures that do not resemble any single theoretical model. Just to get some idea of the comparative power of Sk2 and G1, we will illustrate using samples from two non-normal, unimodal distributions.

6. Summary and Conclusions: Visual displays (e.g., histograms) provide easily understood impressions of skewness, as do comparisons of the sample mean and median. However, students tend to take too literal a view of these comparisons, without considering the effects of binning or the role of sample size.

=======================================================================

3. Differentiate between the following:

(a) Primary data and Secondary data: Solution: Primary Data and Secondary Data The data which is collected for the first time for your own use is known as primary data. The source happens to be primary if the data is collected for the first time by you as original data. On the other hand, if you are using data which has been collected, classified and analysed by someone else, then such data is known as secondary data. The sources of secondary data are called secondary sources. For instance, national income data collected by the Government in a country is primary data for that Government. But the same data becomes secondary for those research workers who use it later. We may, thus, state that primary data is in the shape of raw materials to which statistical methods are applied for analysis.

At the same time secondary data is in the shape of finished products since it has already been treated in some form or the other by statistical methods. In case you have decided to collect primary data for your survey, you have to identify the sources from which you can collect that data. Big enquiries like population census involve very large number of persons to be surveyed but in case of small enquiries like cost of living of industrial workers in a city, the persons to be surveyed may be few.

If you have decided to use secondary data, it is necessary for you to edit and scrutinize such data. Otherwise it may not have the desired level of accuracy or it may not be suitable or adequatefor’your purpose. If you do not edit and scrutinise the secondary data before you use it in your survey, the results of your investigation may not be fully correct. Therefore, secondary data should always be used with great caution. Bowley writes: It is never safe to take published statistics at their face value without knowing their meaning and limitations.

=======================================================================

(b) Sampling and Non-Sampling errors: Solution: Sampling Errors: The errors caused by drawing inference about the population on the basis af samples are termed as sampling errors. The sampling errors result from the bias in the selection of sample units. These errors occur because the study is based on a portion of the population. If the whole population is taken, sampling error can be eliminated. If two or more sample units are taken from a population by random sampling method, their results need not be identical and the results of both of them may be different from the result of the population. This is due to the fact that the selected two sample items will not be identical.

Thus, sampling error means precisely the difference between the sample result and that of the population when both the results are obtained by using the same procedure or method of calculation. The exact amount of sampling error will differ from sample to sample. The sampling errors are inevitable even if utmost care is taken in selecting the sample. However, it is possible to minimise the sampling erfors by designing the survey appropriately. Sampling errors are of two types: (i) biased sampling errors, and (ii) unbiased sampling errors. Non-sampling Errors:

These non-sampling errors can occur in any survey, whether it be a complete , enumeration or sampling. Non-sampling errors include biases as well as mistakes. These are not chance errors. Most of the factors causing bias in complete enumeration are similar to the one described above under sampling errors. They also include careless definition of population, a vague conception regarding the information sought, inefficient method of interview and so on. Mistakes arise as a result ofimproper coding, computations and processing. More specifially, non-sampling errors may arise because of one or more of the following reasons:

i) Improper and ambiguous data specifications which are not consistent with the census or survey objectives.

ii) Inappropriate sampling methods, incomplete quextionnaire and incorrect way of interviewing.

iii) Personal bias of the investigators or informants.

iv) Lack of trained and qualified investigators.

v) Errors in compilation and tabulation.

This list is not exhaustive, but it indicates some of the main possible reasons.

====================================================================

(c) Dispersion and Skewness: Solution: Dispersion: In statistics, the dispersion is the variation of a random variable or its probability distribution. It is a measure of how far the data points lie from the central value. To express this quantitatively, measures of dispersion are used in descriptive statistic. Variance, Standard Deviation, and Inter-quartile range are the most commonly used measures of dispersion. If the data values have a certain unit, due to the scale, the measures of dispersion may also have the same units.

Interdecile range, Range, mean difference, median absolute deviation, average absolute deviation, and distance standard deviation are measures of dispersion with units. In contrast, there are measures of dispersion which has no units, i.e dimensionless. Variance, Coefficient of variation, Quartile coefficient of dispersion, and Relative mean difference are measures of dispersion with no units. Dispersion in a system can be originated from errors, such as instrumental and observational errors. Also, random variations in the sample itself can cause variations. It is important to have a quantitative idea about the variation in data before making other conclusions from the data set.

Skewness: In statistics, skewness is a measure of asymmetry of the probability distributions. Skewness can be positive or negative, or in some cases non-existent. It can also be considered as a measure of offset from the normal distribution. If the skewness is positive, then the bulk of the data points is centred to the left of the curve and the right tail is longer. If the skewness is negative, the bulk of the data points is centred towards the right of the curve and the left tail is rather long. If the skewness is zero, then the population is normally distributed.

In a normal distribution, that is when the curve is symmetric, the mean, median, and mode have the same value. If the skewness is not zero, this property does not hold, and the mean, mode, and median may have different values. Pearson’s first and second coefficients of skewness are commonly used for determining the skewness of the distributions. Pearson’s first skewness coffeicent = (mean – mode) / (standard deviation) Pearson’s second skewness coffeicent = 3(mean – mode) / (satndard deviation) In more sensitive cases, adjusted Fisher-Pearson standardized moment coefficient is used. G = {n / (n-1)(n-2)} ∑ni=1 ((y-ӯ)/s)3

===================================================================

(d) Geometric mean and Harmonic mean: Solution: Geometric mean is a kind of average of a set of numbers that is different from the arithmetic average. The geometric mean is well defined only for sets of positive real numbers. This is calculated by multiplying all the numbers (call the number of numbers n), and taking the nth root of the total. A common example of where the geometric mean is the correct choice is when averaging growth rates.

Formula:

Geometric Mean :

Geometric Mean = ((X1)(X2)(X3)……..(XN))1/N

where

X = Individual score

N = Sample size (Number of scores)

Harmonic Mean: Probably the least understood, the harmonic mean is best used in situations where extreme outliers exist in the population. The harmonic mean can be manually calculated; however, most people will find it much easier to just use Excel. In Excel, the harmonic mean can be calculated by using the HARMEAN() function. There are plenty of online resources (see Wikipedia) that cover the mathematical derivation of the harmonic mean; we are going to focus on when one should use it. If the population (or sample) has a few data points that are much higher than the rest (outliers), the harmonic mean is the appropriate average to use. Unlike the arithmetic mean, the harmonic mean gives less significance to high-value outliers–providing a truer picture of the average.

===================================================================================

4. Find median, Q3, D7, P75 from the following distribution. Class –interval| 0-10 | 10-20| 20-30| 30-40| 40-50| Frequency | 4| 7| 10| 6| 3|

Solution:

Class- interval| Frequency| Cumulative frequency|

0-10| 4| 4|

10-20| 7| 11|

20-30| 10| 21|

30-40| 6| 27|

40-50| 3| 30|

Median has N/2 items below it which means 30/2=15 items below it. Therefore, the median lies in the 20-30 class. Now applying the formula of interpolation:

N –C

Md= l+ 2 X i

f

where l=20

c=11

f=10

i=10

N=30

Md= 20+15-11 x10

10

= 20+(4/10)x 10

= 20+4

= 24

MEDIAN = 24

D7 has 7N/10 items below it, which means 7 x 30/10 = 21 items below it. So D, lies in the 20-30 class:

7N _C

Now D7 = 1+ 10___ x i

f

= 20+ 21-11 x 10

10

=20+10×10

10

= 20+10

= 30

D7 = 30

P75 has 75N/100 items below it, which means 75 x 30/100 = 22.5’iterns below it. So P75 lies in 30-40 class. 75N – C

Now P75 = 1+ 100____ x i

f

where 1 =30

c=21

f=6

i=10

P75 = 30+22.5-21 x10

6

=30+ 1.5 x10

6

= 30+2.5

= 32.5

P75 =32.5

Q3 has 3N/4 items below it, which means 3 x30/4 = 22.5 items below it. P75 also has 22.5 items below it. So Q3 must be same as P75

Q3 = Rs. 32.5

========================================================================

5. Write short notes on the following:

(a) Probability Sampling:

Solution: Probability sampling Methods:

In the case of probability sampling method, each and every item in the population has a probability or chance of being included in the sample. Thus, in,this method every member of the population has an equal chance of selection into the sample. Under this probability sampling, there are various methods such as:

1 Simple random sampling

2 Systematic sampling

3 Stratified sampling

4 Cluster sampling

5 Area sampling

6 Multi-stage sampling

1 Simple Random Sampling: This method is also known as chance or lottery sampling method. In this case each and every item in the population has an equal chance of inclusion in the sample and each one of the possible samples has the same probability of being selected. This is the most common method used when the population is a homogeneous group. To identify the sample unit, normally, random numbers are used.

2 Systematic Sampling: Under this method, population is arranged in alphabetical, serial order etc. Then the sample units appearing at flxed intervals are selected. Thus, you may select every 14th name on a list, every 10th house on the side of a street and so on. Element of randomness is introduced into this method of sampling by using random numbers to pick up the first unit with which to start. Thus, in this method, the selection process starts by picking some random point in the list of . population, and the’units are to be selected until the desired number is secured.

3 Stratified Sampling: This method is generally used when population is not a homogeneous group. Under this method, population is divided into a number of homogeneous sub-populations or strata. While doing this, care should be taken to avoid overlapping. After stratification, the sample items are randomly selected from ‘ each stratum either on proportionate or equal basis.

4 Cluster Sampling: This method involves grouping the population into heterogeneous groups called ‘clusters’ and then selecting a few of such groups (or the clusters) by simple random sampling method. All the items in the selected clusters are studied for accomplishing the survey work.

5 Area Sampling: This method is very close to cluster sampling. 1t is generally followed when the total geographical area to be covered under the survey is spread very widely. In this sampling method, the geographical area is first divided into anumber of smaller areas and then a suitable number of these smaller areas are randomly selected. All units of these selected small areas are then studied and examined for accomplishing the survey work.

6 Multi-stage Sampling: This method is suitable for big surveys extending to a considerably large geographical area or the population is heterogeneous,,For instance, in a survey you want to select some families from all over the country. Under this multi-stage sampling method, the first stage may be to randomly select a few states. At the next stage, from each sample state.you can randomly select a few districts. Then at the third stage you can select a few towns from each of the selected districts. Finally, certain families may be randomly selected within the selected towns. Thus, in this method stratification is done at four stages to constitute a final sample. It may be noted that in this multi-stage sampling, each and every item of the population has a chance of being selected but this chance need not be same for all items.

=================================================================

(b) Statistical Derivatives: Solution: when one or more numbers are being compared with another number, the figure which is taken as the standard for comparison is known as the base. Which type of base should be chosen would depend upon the situation. Any derivative by itself is generally not meaningful for the analysis of a given problem. For instance, it is stated that a company earned 18% return on its investment during the current year.

What – does this signify? You may ask whether or not this is a high rate of return. Any meaningful use of derivatives requires comparison with some standard yardstick so that their significance can be evaluated. The return of 18% can be either compared with last \ year’s return or with another competing firm’s return on investment, if they are comparable. While the derivatives are used to compare different groups, it is a common practice to

reduce them to a common denominator and thereby the comparisons are made simple and more meaningful. Suppose, two business firms were started with a capital of Rs. 50,000 and Rs. 1,20,000 respectively.

At the end of the year, the first business firm made a profit of Rs. 20,000 and the second business firm earned a profit of Rs. 40,000. It apparently shows that the second business has made double the profit of the first business. But by reducing them to a common denominator of 100, it can be seen that the first business has made a profit of 40% of the capital and the second business firm mad6 a profit of 33% of the capital. The impression which you gather by looking at the absolute numbers is reversed now.

Thus, profit as a percentage of capital is really more meaningful. The derivatives are also useful in estimating the unknown quantity. For instance, the birth rate in a particular region is known and it can be assumed to be fairly constant over a period of time. If you know the total number of births, at aspecific point of time, you can estimate the population at that point of time. Thus, thederivatives are useful in the estimation of unknown quantities, over and above simplifying the data and increasing their comparability.

========================================================================

(c) Diagrams: Solution: The following guideline should be kept in mind while preparing diagrams :

1) A diagram is to be prepared on the graphic axes–‘)(‘ axis and ‘Y’ axis. However, it is not necessary to use a graph paper. While taking scales on these two axes, it must be emured that the data is being presented in a meaningful manner. The scale on the two axes should be clearly set up.

2) Whenever the data are to be presented on the ‘Y’ axis (vertical scale), the scale should start from zero. Generally, the vertical scale is not broken.

3) A diagram must always have a concise and self-explanatory title.

4) Colours and shades should be used to exhibit various components of a diagram and a key , be provided.

5) To make the diagram attractive, leave reasonable margin on all sides of the diagram. The diagram should not be too small or too big.

6) If a number of diagrams are to be prepared, it is desirable to number them for the purpose of reference. TYPES OF DIAGRAMS: Diagrams are generally classified on the basis of length, breadth and height. Broadly, diagrams are classified as : 1) one dimensional diagram, 2) two dimensional diagram, and 3) three dimensional diagram. Besides these diagrams, the data can also be presented in the form of maps and pictographs.

========================================================================

(d) An Ideal Average: Solution: An ideal average should possess the following characteristics:

1) Easy to understand and simple to compute: It should be easy to make out an average and its computation should also be simple.

2) Rigidly defined: An average should be rigidly defined by a mathematical formula so that the same answer is dived by different persons who try to compute if. It should not depend on the personal prejudice or bias of a person computing it.

3) Based on all items in the data: For calculating an average, each and every item of the data set should be included. Not a single item should be dropped, otherwise the vqhe of the ‘average may change.

4) Not to be unduly affected by extreme items: A single extreme value i.e., a maximum value or a minimum value, can unduly affect the average. A too small item can reduce the value of an average, and a too big item can inflate its value to a large extent. If the average is chaniing with the inclusion or exclusion of an extreme item, them is not a truly representative value of the data set.

5) Capable of further algebraic treatment: An average should be amenable to further algebraic treatment. That should add to its utility. For example, if we are given the averages of three data sets of similar type, it should be possible to obtain the combined average of all those three data sets.

6) Sampling stability: The average should have the same ‘sampling stability’. This means that if we take different samples from the aggregate, the average of any sample should approximated turn out to be the same as those of other samples.