The previous data shown is the original data of used cars. I have decided to investigate correlation if any, between the price of a used car and it’s mileage. I did this because I wanted to know why some used cars cost so much more than others.
I intend to take a sample of the data provided, simply because it would be extremely difficult to take all the data and use it.
I am going to take a sample of 50 used cars, half of the total amount of data because this should enable me to have enough data to recognise any correlation between the price and mileage with the smallest possible sample for ease.
I aim, not only to investigate the correlation of the given data, but also, how this compares with other used car data. This will be done to see whether this data of used cars follows the trend of data of other used cars or whether this whole data is totally anomalous comparatively. If there is only a slight difference between the two data, then it will be easy for me to justify why they are different and what could have affected them.
My HYPOTHESIS is that as the mileage on a used car increases, the price decreases. I think this will happen in both sets of data.
There are many ways in which I can take a sample of the given data. For instance I could use a totally random sample, for example by picking them out of a hat. However this method might not on this occasion give fair results. For instance 50 Fords and Nissans could be chosen totally randomly, or 50 cars under the price of ï¿½5000 could be chosen. This statistically would not give me very good results, as it would not solve my problem. It would only tell me whether there is correlation between mileage in Ford and Nissan used cars or cars under the price of ï¿½5000.
I could use a systematic sampling method. This would mean I would have to systematically choose cars from the given list of used car data. For instance, every other one or every fourth one. However, as the list of data is in not ordered in any sort of strata at all, this method would also statistically not give me good enough results. For example, because the data is listed randomly, and I used every fourth one, I may miss out many cars and by chance could end up will all the Fords and Nissans again, or with all the cars chosen under the price of ï¿½5000. This again would not solve my problem for the reasons given before.
A stratified sampling method could also be used. This is a method where I would have to order the given data in strata first, and then mathematically choose a certain amount from each stratum. This would mean if I ordered the data by the make of the car, then I would take a certain number of cars from that group and then choose a certain amount from another group, which would be a different make of cars. This would mean I would get some cars of almost every make. This is guaranteed and is statistically a much better method. I could also put the data in strata of price. For example, I could group all the cars under ï¿½1000 and all the cars under ï¿½2000, under ï¿½4000 etc. This would be a good idea, as it definitely would make sure I do not just get the similar data. E.g. All the Fords and Nissans. To know how many cars to choose from each stratum a mathematical formula is needed:
The total sample needed
Multiply by the number of items in the
The total data the sample Strata.
Is going to be take from
For my investigation, because I have decided to take sample of 50 out of a total of 100, I have worked out that I would need to take half of the data from each stratum.
Overall I believe I will need to make use of all of the above methods. Firstly, I will order the given data in strata. I have decided that for my specific problem I will order the strata in the prices. I have done this because this way, I know I will get a very good variety of prices, which will tell me if there is a certain correlation with mileage. If I were to order them in any other strata, then it would limit the chances of getting a good variety of prices. If for instance, I ordered them in the form of the make of car, then there could be a very slight chance that all the cars chosen could be in a very tight range of price.
Once the data is ordered by price, I will need to take half from each strata to give me my total or as close to my total sample of 50. I will then have a choice to make as to which car(s) I will need to add to my sample if the total falls less than 50 or which car(s) I will have to reject if the total goes over my sample of 50.
The next big question for me will be: Which cars of each stratum will be picked? I know half of them will need to be picked, but which ones. This is where the other two methods come in. I will need to use a systematic technique to obtain the half from each stratum. Therefore I have decided to choose every other one, because taking every other one will give me one more or one less than half of the strata, depending on which car in the strata I begin with. This is where the random sampling comes in place. Which car do I start on? The first, or the second? I have decided to use a dice to decide. A fair dice (1-6) will be thrown. An even number will mean I will have to start systematically from the FIRST car in the strata. The dice will be THROWN FOR EACH STRATA. I have done this to give as much chance as possible to each car to be chosen.
I intend to do exactly the same for the 2nd piece of data, taking exactly half of the data there as well. I then intend to draw relevant diagrams such as stem and leaf diagrams and scatter diagrams of each so that the sets of data can be compared in as many relevant ways as possible.
From this plan, I see there will be a limitation. When I take half of the cars from each stratum, if there are an odd number of cars, how many do I take? For example, if there are 7 cars in a group I need to try and take half of 7 cars. I cannot take 3.5 cars.
I will act on this limitation in a certain and as fair a way as possible. I have worked out that I will have exactly 4 groups of cars, which have odd numbers of cars within them. I will act upon this in a systematic way. In every other group, I will round the half of the odd number of cars up. For example, if there are an odd number of cars in a group I will have to take half. Taking the example from above with 7 cars. If I half 7 I get 3.5. I will round it up to 4 and take 4 cars. Then the next time I come across a group with odd numbers, I will round it down. Hence every other group. However, I can foresee a problem within the rectification of the limitation:
How will I know which group to round up on? I have decided to use a fair coin. The coin will be tossed. If it is heads I will start on the 1st odd numbered cars group, and then every other one from then on end. If it is tails, then I will start rounding up on the 2nd group with an odd number of cars in it.
From this plan, I cannot foresee any more problems, however, I may encounter some while actually carrying out the sampling.