The Company's PC Data Analysis process

Imagine that your boss think that you are ready to solve a problem for a new client, and you answer him with: "Yes! of course"

The client is a Finance company, to work they use very much computers, every different from each other. They want all computers of the same appearance and characteristics to improve their compatibility

You need to be careful selecting the computer, should be cheap, with enough resources, and preferently easy to give it maintainment

You are free to choose the data from you need, and your audience will be various teams in our and their company

My role

Project management
Research for data
Cleaning & documenting the process
Analyze data
Make visualizations presentation to workmates

Asking

At this part there is some questions principally for myself, to determine more exactly "What will I analyze?", but inside data analysts teams this phase could change a lot

What themes am I exploring?
Computers, their sales, advantages and characteristics
What is the problem I am trying to solve?
Discover what PC will offer more possibilities, can be laptops or desktops, buy separated components or all in one packages
Which metrics I will use to measure the data in order to achieve the goal?
Number of sales, prices and PC reviews
Who are my stakeholder?
A big variety of workers with distinct tasks and objectives inside out company and the company client
What is my audience?
Probably people of all types, origins, ages and studies
How can help my knowledges to the client to take decisions?
For me is a easy theme to understand and have seen that use all in one laptops are recommended

Prepare the environment

This phase is to think a few more specificly about the data, and collecting it keeping in mind the bias

Then the type of data I am looking for is a little unexplored but there are this four datasets in Kaggle that help me a lot. Keep in mind, this datasets could have some bias because the creators didn't add so much metadata

Laptop league: Data from the Zenodo store
Original Kaggle dataset
AmazonPC: Data from the Amazon store
Original Kaggle dataset
PC component price comparison: Data from different pages and stores through the web
Original Kaggle dataset
Curry World laptop: Data obtained from a store called Curry and the creator created it using Selenium and Python
Original Kaggle dataset

The next step was put all this in a directory system, naming each file and folder with a standardized format like name_day_month_year_v01

All this are in Kaggle, they has Creative Commons (CCO) Domain Public License, for that reason there isn't need to give credits to the authors, but you can give they a point up in Kaggle

Process and cleaning data

As programmer I was able to choice languages like Python or R but I thought in choose Microsoft Excel to do this project because the four datasets has the .CSV format and this tool is the most efficient with these small datasets.

The integrity of the files is I expected, the most of registers inside each dataset are not clean, this is to say with NULLs and blanks or the text is not enough coherent

Therefore it's necessary use this phase to clean the data because there are errors inside the dataset. This is achieved with techniques like

Filters to delete speedly the rows with blank cells
Formatting data types to convert text to numbers
Formulas like =LEFT( , ) to substract the numbers from ratings cells
Find & Replacing to remove extra characters
Column renaming because the tables was different types of names and the standard is all lowercase separated by underscores

Analyze phase

I worked individually the four dataset because they have not enough compatibility between columns being the only two common were the price and rating. Then I ordered descently by price in all

Dataset 1

This is the AmazonPC dataset and first I realize that some rows are accesories, delete all the accessories in the column name. Here not seems like there's a tendency between price and rating

Dataset 2

This is the Components dataset and here there are categories, I would like me to divide the dataset in categories but they are eight
Only filter categories one by one copying the 10 cheapest to another sheet making a table with eighty rows. Inside that new sheet sum the most cheapest components of all and the result is $ 114.88, not bad
Afterly I used functions as =IF( , ) and =AVERAGE( ) to calculate the price average of each category, the result is: one common PC costs $1415.8 maybe ten times the cheapest
Then I make one visualization with the component price and component rating column

Dataset 3

This is the Curry dataset where the data is almost completely clean and just now is visible that the best prices are in the Lenovo Pads
Here something strange happen, the visualizations had too many horizontal values as if there was more dimentions in the data. I solved this making before Pivot tables in a new sheet, a pivot table for processor capacities and laptop sizes also for prices
I notice that laptop sizes often are similar whatever the processor capacity, while the processor price and capacity has a tendency or relation, I mean more price could be more laptop capacity

Dataset 4

This is the Laptop League dataset and delete the word "Processor: " in the cell's column processor, seems like the primebook are the most economic option
I create a visualization between the price and the rating column
Here i will compare again processor and prices to confirm the tendency but there is an issue, in the data there are different generations for each processor model and I guess that this is not necessary in this project
Again use a pivot table and then i collapse various generation models to only one model manually
Then I make three different pivot tables to compare: rating VS number_of_rating, rating VS number_of_reviews and so on. now it's not necessary to collapse the processor generations
To finish i make the expected price and rating visualization where the result was awesome for me

Until now are here some new answers to the questions in the asking phase. The themes that I am exploring are PC hardware which include peripherals that were removed. Also, I know how coexist different CPUs in the market, what are more similar and compatibles

Now we have an insight from each individual dataset, is a good idea do a review about all this information before to share all with others

Share the results

preice VS rating scatter plot — Firstly this is the price and rating relation from the AmazonPC dataset

components price VS componet reviews scatter plot — Firstly this is the price and rating relation from the AmazonPC dataset

Conclutions

There is not tendency between the PC quality and its price (according to AmazonPC, Components and Laptop League datasets)
The PC's price is approximated to its processor capacity, altought you can find very good options at low cost (according to Curry dataset)
There is not relation between processor capacity or the Laptop size and the processor capacity (according to Curry and Laptop League datasets)

Take action

This is the last phase in the data analysis process where the insights and conclutions move to the real world. All was done to make logical changes that represent an improvement in the company. As this is a personal project

Soon, to achieve a deeper and further data analysis I recomend to continue with:

Looking for patterns between PC brands or components and their quality
Put more details about the data origin and try to find some type of bias within

What I learned

I almost had not knowledges about the Data analysis field that is so widely mentioned in different contexts, as programmer think that this is a job oportunity because is a few similar in the structured thinking and in the tools to carry out the process

I liked the data analysis process and enjoyed make this project