Plan

In this investigation I will be comparing the readability of two articles from two different types of media. For example taking one article from a gossip magazine and another from a political newspaper, or taking two articles from the same media, such as two newspapers, that are aimed at different types of people. To compare the readability I could either take a set number of words from the two articles, for example taking 100 words from a 300 word article and a 500 word article.

This would however be biased towards the first article as a larger percentage of the words would be take from it, and there would also be no clear indication of which 100 words I should take from the article. Whether I should take the first hundred words, or taken randomly from the middle, this would make the data inaccurate and biased. In view of this I will take a stratified sample of a set percentage from both of my articles, this means that if I were taking 10% stratified from each article then 30 words would be taken from the 300 word article and 50 words from the 500 words article. This means that the size of the data taken from each will be different but this won’t make a difference whilst making the calculations.

Once I have collected the set number of words from each article I will count the number of letters in the words and group them in a table. I will then take the mode, median and mean from both sets of data and use these calculations to compare the readability of the two articles. I will also plot the data on suitable graphs for ungrouped data and find further calculations to further support my investigation such as the standard deviation and interquartile range of the data. Before I start my main investigation I will first conduct a pilot to specify any problems that may occur in the main investigation.

Pilot

In my pilot I will use an article from ‘The Independent’ called ‘Teen music project wins Philip Lawrence award’. The pilot will prove as a test run to point out any abnormalities that may show up later in the investigation and need to be made clear before the investigation starts. The article has 639 words and I want only a small number of words as it doesn’t need to represent the article as it is only the pilot. I will take only 20 words from this article which is roughly 3% of the entire article; this means I will be counting every 32nd word. I will count the number of letters in each word and group them into a chart.

st collecting my data I found a number of areas that need to be addressed that may occur in the main

* Hyphenated Words – Any two or more words separated by a hyphen are to be counted as individual words.

* Numbers – Any numbers found in numeric form, for example 17 are not to be included in the count, however if the number is written out in letters such as seventeen then it will be included in the count.

* Apostrophes – If a word contains an apostrophe then during the letter count of the word the apostrophe does not count as a letter.

Although the problematic areas have been addressed from this article there may still be areas from the two main articles that would cause problems during the actual counting of the data. In case of this I have read through both articles in advance just to pre-identify any issues, I found the following:

* In the Kerrang article the brand-name ‘Playstation’ will be included in the count; however the abbreviation of this ‘PS2’ will not be included. Also the brand name ‘Xbox’ will not be included in the count.

* In the brand name ‘Guitar Hero II’ the ‘II’ will not be included in the count.

* In the ‘Reveal’ article the abbreviation R&B will not be included during the collecting of data.

Anything other than what is stated above is to be included in the data count. I will now begin collecting my data for my main investigations.

The two articles I will be comparing are from two different magazines, one a gossip magazine called ‘Reveal’ and the other a music magazine called ‘Kerrang!’. The first article from the ‘Reveal’ magazine has a main title of ‘I Beat Beyonce to No 1’ and has a total of 457 words, the second article from the ‘Kerrang!’ magazine is called ‘The Astonishing Rise Of Guitar Hero’ and has a total of 651 words. There is around a 200 word gap between the two articles, but because I am taking a percentage this shouldn’t make a difference in the final results. I am going to take 20% of the words from each article; this means I will be taking 91 words from the Reveal article and 130 words from the Kerrang! article. I will need to count every 5th word and count the number of letters in each word. Once I have collected the data I will record it in the frequency table below.

Reveal Article

Kerrang! Article

Number Of Letters (x)

Frequency (f)

fx

Number Of Letters (x)

Frequency (f)

Fx

From this table I can now collect the mean, mode and median of the data which will hopefully show a difference in the average length of a word for each article.

To fine the mean of the data I need to take the sum of fx (?fx) and divide it by the sum of f (?f).

To find the mode for each article I need to find which of the number of letters has the highest frequency.

To find the median I will draw a cumulative frequency polygon; I will determine the median and Interquartile range of both sets of data.

Median:

To find the median from the cumulative frequency polygon I need to divide the total cumulative frequency from both articles by 2. To find the Interquartile range I will need to find the lower quartile, which is found by dividing the cumulative frequency total by 4, and the upper quartile is found by dividing the cumulative frequency total by 4 the multiplying it by 3. These calculations will now be plotted on the cumulative frequency polygon.

For the ‘Kerrang!’ article the median and Interquartile range are:

Median – 4.1

Interquartile Range = Upper Quartile – Lower Quartile = 6 – 2.7 = 3.3

For the ‘Reveal’ article the Median and Interquartile Range are:

Median – 3.6

Interquartile Range = Upper Quartile – Lower Quartile = 4.2 – 2.4 = 1.8

I have used the Interquartile Range rather than the range of the data, as the range is not as reliable, this is because it can be affected by very high and very low pieces of data whereas the Interquartile range is only the range between the first and third quartiles. This eliminates any outliers which may affect the data.

