The Company's PC Data Analysis process

Imagine that your boss think that you are ready to solve a problem for a new client, and you answer him with: "Yes! of course"

The client is a Finance company, to work they use very much computers, every different from each other. They want all computers of the same appearance and characteristics to improve their compatibility

You need to be careful selecting the computer, should be cheap, with enough resources, and preferently easy to give it maintainment

You are free to choose the data from you need, and your audience will be various teams in our and their company

My role

Asking

At this part there is some questions principally for myself, to determine more exactly "What will I analyze?", but inside data analysts teams this phase could change a lot

Prepare the environment

This phase is to think a few more specificly about the data, and collecting it keeping in mind the bias

Then the type of data I am looking for is a little unexplored but there are this four datasets in Kaggle that help me a lot. Keep in mind, this datasets could have some bias because the creators didn't add so much metadata

The next step was put all this in a directory system, naming each file and folder with a standardized format like name_day_month_year_v01

All this are in Kaggle, they has Creative Commons (CCO) Domain Public License, for that reason there isn't need to give credits to the authors, but you can give they a point up in Kaggle

Process and cleaning data

As programmer I was able to choice languages like Python or R but I thought in choose Microsoft Excel to do this project because the four datasets has the .CSV format and this tool is the most efficient with these small datasets.

The integrity of the files is I expected, the most of registers inside each dataset are not clean, this is to say with NULLs and blanks or the text is not enough coherent

Therefore it's necessary use this phase to clean the data because there are errors inside the dataset. This is achieved with techniques like

Analyze phase

I worked individually the four dataset because they have not enough compatibility between columns being the only two common were the price and rating. Then I ordered descently by price in all

Dataset 1

This is the AmazonPC dataset and first I realize that some rows are accesories, delete all the accessories in the column name. Here not seems like there's a tendency between price and rating

Dataset 2

Dataset 3

Dataset 4

Until now are here some new answers to the questions in the asking phase. The themes that I am exploring are PC hardware which include peripherals that were removed. Also, I know how coexist different CPUs in the market, what are more similar and compatibles

Now we have an insight from each individual dataset, is a good idea do a review about all this information before to share all with others

Share the results

preice VS rating scatter plot
Firstly this is the price and rating relation from the AmazonPC dataset
components price VS componet reviews scatter plot
This is the same comparison or relation but with the dataset Components_scraper, remember this is components data from distinct web stores
processor and its price barc hart
Now keep in mind, big part of the PC power is determined by the processor, for that reason is normal use the processor capacity as a measure of PC speed. Normally little processor power is sold with reduced capacities and vice versa
processor and its capacity bar chart
Is convenient to know also if the PC size has something to do with its processor capacity and this is the result
capacity of each price scatter plot
I needed to confirm if really there is one narrow relation between the processor capacity and the price
capacity and price line diagram
And this is another way to Watch the result, I realize that the first top bar would be the best quality-price option
processor model and his price bar chart
Then to obtain processor visualizations was needed to group the processor generation by model, from the Laptop League dataset
number of reviews VS number of reviews scatter plot rating VS number of reviews scatter plot rating VS number of rating scatter plot
In this dataset there is more detailed information about the ratings and is really interesting to me this data in this project to confirm if the lower rating correspond to the computers that we are looking for
We can notice that the unique relation is between number of reviews and number of ratings, something that is not enough help. At least know that lower prices doesn't equals to bad products
price VS rating scatter plot
Let's go to confront the processor price and the rating has been hide something important
The rating of each processor bar chart
To finish I wanna see how the processor are rated by their capacity
Even althought there is some with very low capacity, in general the rating is uniform to all of them, therefore there is PCs which has good rating, good price and good potential, seems a better idea to choose a laptop

Conclutions

  1. There is not tendency between the PC quality and its price (according to AmazonPC, Components and Laptop League datasets)
  2. The PC's price is approximated to its processor capacity, altought you can find very good options at low cost (according to Curry dataset)
  3. There is not relation between processor capacity or the Laptop size and the processor capacity (according to Curry and Laptop League datasets)

Take action

This is the last phase in the data analysis process where the insights and conclutions move to the real world. All was done to make logical changes that represent an improvement in the company. As this is a personal project

Soon, to achieve a deeper and further data analysis I recomend to continue with:

  1. Looking for patterns between PC brands or components and their quality
  2. Put more details about the data origin and try to find some type of bias within

What I learned

I almost had not knowledges about the Data analysis field that is so widely mentioned in different contexts, as programmer think that this is a job oportunity because is a few similar in the structured thinking and in the tools to carry out the process

I liked the data analysis process and enjoyed make this project