Code
library(tidyverse)
library(jsonlite)
library(reticulate)
library(ravelRy)
library(IRkernel)
library(plyr)
library(dplyr)
use_condaenv("C:/Users/duckd/anaconda3/python.exe")
Powell Sheagren
This tab will go through my process of cleaning and prepping the raw data which I captures on the previous data gathering tab. I go through a few different steps and methods and eventually end up with a representative data frame.
This step is the easiest and just requires loading the document saved from the data collection phase and reconverting it to a data frame. There was only a slight complication in that when it reads in the file it takes the whole json and nests it in an array, so in order to convert back we have to double index. I’m also going to run a quick distinct function in case of repeated data.
There are many different data points which were collected with the get_patterns function. Many of them are specific to identification or excess date information about when patterns were created which I may include later but will remove for now. I’m generally looking for information about the needles and yarn needed which includes type or size on both metrics. The other information I hope to get is the pattern type which has entries such as hat, scarf, towel, dishcloth, etc. There aren’t many of those patterns which is why to get significant amounts I aim to get large amounts of data. So, on this first step I will list all of the columns and then select the ones I’m interested in.
patterns_df <- pattern_data_csv %>% select(c("comments_count", # metric of interest
"currency", "price","currency_symbol","free", # financial aspects
"published", # when the pattern was published
"difficulty_average", # recorded difficulty average
"favorites_count", "projects_count", "queued_projects_count", # other metrics of interest
"gauge", "gauge_divisor", "gauge_pattern","row_gauge", # technical gauge aspects
"id", "name", "permalink", # referencing information
"rating_average","rating_count", # ratings information
"yardage","yardage_max","yarn_weight_description", # amount and weight of yarn required for pattern
"yarn_weight", # nested dataframe with weight information, could be redundant
))
dim(patterns_df)
Variables were dropped for a couple of reasons. Some had excessive amounts of irrelevant information like the photos column (dropped in data gathering) or were too internally different for analysis like the sizes_available (dropped in data gathering) which was entered differently each time. Many were booleans relating to pattern availability which was irrelevant to a detailed database. Others, lastly, were redundant information which repackaged other variables; in these cases I used the columns that already had the original data instead of the combination. This saved me a few steps here and there. Overall, I have dropped from an initial column amount of 53 to 30, however this amount may increase as I start to expand some of the nested data frames.
Out of all of the nested dataframes there are some which just contain 3 columns of ids, values, or other without multiple rows. These I hope to go through and represent as a couple of columns. The other more dense nested dataframes will be parsed later as those often have a couple layers or many more options of data. I will start looking at instances of all the nested data frames.
pattern_needle_sizes | craft | pattern_categories | pattern_attributes | pattern_type | |
---|---|---|---|---|---|
<list> | <list> | <list> | <list> | <list> | |
1 | 7 , 7 , 4.5 , , FALSE , TRUE , , US 7 - 4.5 mm, 4.5 | 2 , Knitting, knitting | 895 , Other , other-accessories, 337 , Accessories , accessories , 301 , Categories , categories | 265 , 267 , 311 , chart , written-pattern , stripes-colorwork | FALSE, 9 , Other, other |
2 | 2 , 2.0 , 2.75 , 1 , TRUE , FALSE , C , 2.75 mm (C), 2.75 | 1 , Crochet, crochet | 306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories | 62 , 114 , 150 , 265 , 267 , lace , sleeves , straight , chart , written-pattern | TRUE , 4 , Pullover, pullover |
3 | 5 , 7 , 5 , 7 , 3.75 , 4.5 , , , FALSE , FALSE , TRUE , TRUE , F , , US 5 - 3.75 mm, US 7 - 4.5 mm , 3.75 , 4.5 | 2 , Knitting, knitting | 306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories | 3 , 10 , 181 , 205 , 211 , 257 , 265 , 267 , 286 , 88 , unisex , adult , stranded , seamless , top-down , positive-ease , chart , written-pattern, in-the-round , circular-yoke | TRUE , 4 , Pullover, pullover |
4 | 9 , 9 , 5.5 , , FALSE , TRUE , I , US 9 - 5.5 mm, 5.5 | 2 , Knitting, knitting | 306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories | 1 , 3 , 9 , 10 , 64 , 66 , 91 , 150 , 204 , 205 , 211 , 257 , 267 , 285 , 286 , 311 , male , unisex , teen , adult , ribbed , stripes , boat-neck , straight , one-piece , seamless , top-down , positive-ease , written-pattern , worked-flat , in-the-round , stripes-colorwork | TRUE , 4 , Pullover, pullover |
5 | 21 , 20 , 9 , 1½ , 2½ , 9 , 2.5 , 3 , 5.5 , , , , FALSE , FALSE , FALSE , TRUE , TRUE , TRUE , , , I , US 1½ - 2.5 mm, US 2½ - 3.0 mm, US 9 - 5.5 mm, 2.5 , 3 , 5.5 | 2 , Knitting, knitting | 339 , Scarf , scarf , 338 , Neck / Torso, neck-torso , 337 , Accessories , accessories , 301 , Categories , categories | 8 , 10 , 64 , 267 , 268 , 285 , child , adult , ribbed , written-pattern, video-tutorial , worked-flat | TRUE , 1 , Scarf, scarf |
6 | 7 , 7 , 4.5 , , FALSE , TRUE , , US 7 - 4.5 mm, 4.5 | 2 , Knitting, knitting | 306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories | 2 , 10 , 18 , 86 , 103 , 267 , 285 , 295 , female , adult , fitted , seamed , scoop-neck , written-pattern, worked-flat , other-edging | TRUE , 4 , Pullover, pullover |
Overall the path forward here is to collapse the columns instance by instance and then filter out redundant information while keeping some identification for potential merge.
For pattern_needle_size we can record both crochet and knit values and then clear out redundant information. In this process we somehow lost a couple data entries, I’ll look into this and just blast through the rest. I added a distinct because some patterns had multiple needles sets but I don’t really care so I’ll just take one of them at random.
## Example:
patterns_df_denest[5,24]
## precleaning step:
dim_accum <- c()
for(i in 1:nrow(patterns_df_denest)){
each <- patterns_df_denest[i,24][[1]] %>% class()
dim_accum <- c(dim_accum,each)
}
patterns_df_denest_prep <- patterns_df_denest
patterns_df_denest_prep$needle_check <- dim_accum
## Unnesting
patterns_df_denest1 <- patterns_df_denest_prep %>%
filter(dim_accum == "data.frame") %>%
unnest(pattern_needle_sizes, names_sep = "_nested_") %>%
distinct(id,permalink,.keep_all = TRUE) %>% select(-c(
"pattern_needle_sizes_nested_id","pattern_needle_sizes_nested_us_steel","pattern_needle_sizes_nested_crochet","pattern_needle_sizes_nested_knitting","pattern_needle_sizes_nested_pretty_metric","needle_check" # removing the other redundant columns
))
id | us | metric | us_steel | crochet | knitting | hook | name | pretty_metric | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <dbl> | <chr> | <lgl> | <lgl> | <chr> | <chr> | <chr> | |
1 | 21 | 1½ | 2.5 | FALSE | TRUE | US 1½ - 2.5 mm | 2.5 | ||
2 | 20 | 2½ | 3.0 | FALSE | TRUE | US 2½ - 3.0 mm | 3 | ||
3 | 9 | 9 | 5.5 | FALSE | TRUE | I | US 9 - 5.5 mm | 5.5 |
There is a similar process for craft which may also have the side effect of allowing us to remove some of the columns from the needles dataset.
id | name | permalink | |
---|---|---|---|
<int> | <chr> | <chr> | |
1 | 2 | Knitting | knitting |
This is an important column to be able to separate out as well as this information will be acting as the main target column for all of the prediction datasets.
clothing | id | name | permalink | |
---|---|---|---|---|
<lgl> | <int> | <chr> | <chr> | |
1 | TRUE | 4 | Pullover | pullover |
We were successful although the naming convention may be changed later on as the names of the columns are fairly dense.
The pattern categories may not have as much importance later on as there are far too many categories although there may be room for visualizations in the categories.
## Example
patterns_df_denest3[[30277,29]][[1]] %>% class()
#1:nrow(patterns_df_denest3)
for(i in 1:nrow(patterns_df_denest3)){
piece <- patterns_df_denest3[[i,29]]
if(class(piece[[1]]) == "data.frame"){
piece <- piece %>%
as.data.frame() %>%
select(permalink,parent.permalink)
piece <- piece[1,]
}
else{
piece <- c(NA,NA) %>% t() %>% as.data.frame()
colnames(piece) = c("permalink","parent.permalink")
}
if(i == 1){
category_accum <- piece
} else{
category_accum <- rbind(category_accum,piece)
}
}
colnames(category_accum) <- c("category_permalink","category_parent_permalink")
nrow(category_accum)
nrow(patterns_df_denest3)
## Joining
patterns_df_denest4 <- cbind(patterns_df_denest3,category_accum) %>%
select(-pattern_categories)
Pattern attributes are another fairly difficult nested frame to consider as some patterns have multiple categories or categories are further nested. For this analysis I attempted it with the code below but have left it for future analysis.
## Example:
patterns_df_denest4[[1,29]]
## Getting list of attributes
attributes_total <- c()
for(i in 1:nrow(patterns_df_denest4)){
attributes <- patterns_df_denest4[[i,29]]$permalink
attributes_total <- c(attributes_total,attributes) %>% unique()
}
attributes_table <- data.frame(t(rep(NA,length(attributes_total))))
colnames(attributes_table) <- attributes_total[order(attributes_total)]
attributes_table <- attributes_table[-1,]
attributes_table_og <- attributes_table
for(i in 1:1){
print(i/21)
print(Sys.time())
attributes_table_i <- attributes_table_og
for(g in 1:(nrow(patterns_df_denest4)/21)){
print(g)
attributes <- patterns_df_denest4[[g,29]]$permalink
attributes_g <- data.frame(t(rep(1,length(attributes))))
colnames(attributes_g) <- attributes
attributes_table_i <- rbind.fill(attributes_table_i,attributes_g)
}
rbind(attributes_table,attributes_table_i)
}
Now that most of the nested dataframes have been unpacked, it’s time to check for column types. Some of the values are characters when I want them to be Integers so I will be going through it all and getting the desired values. The names are also horrendous given the multiple rounds of unnesting so I will go through and adjust for relevance, finally reordering the columns for ease of use in the future.
With the columns as they are we will now be shifting some of the columns to numeric measurements. Most columns are already in character form so the categorical or word based groups are not as important to adapt. The exception to this will be the yarn_weight_decription column as there are many names for sizes but no specific ranking so I will be grouping some of them together.
numeric_columns <- c(1,3,7,8,9,10,11,12,14,15,18,19,20,21,24)
for(i in 1:ncol(patterns_denested)){
if(i %in% numeric_columns){
patterns_denested[,i] <- patterns_denested[,i] %>% as.numeric()
}
}
levels(factor(patterns_denested$yarn_weight_description))
for(i in 1:nrow(patterns_denested)){
value <- patterns_denested[i,22]
if(value == "Lace"|value == "Thread"|value == "Light Fingering"){
patterns_denested[i,22] <- 0
} else if(value == "Fingering (14 wpi)"|value == "Cobweb"){
patterns_denested[i,22] <- 1
} else if(value == "Sport (12 wpi)"|value == "DK / Sport"){
patterns_denested[i,22] <- 2
} else if(value == "DK (11 wpi)"| value == ""| value == "Any gauge - designed for any gauge"){
patterns_denested[i,22] <- 3
} else if(value == "Aran / Worsted"| value == "Aran (8 wpi)"|value == "Worsted (9 wpi)"){
patterns_denested[i,22] <- 4
} else if(value == "Bulky (7 wpi)"|value == "Super Bulky (5-6 wpi)"){
patterns_denested[i,22] <- 5
} else if(value == "Jumbo (0-4 wpi)"){
patterns_denested[i,22] <- 6
}
}
patterns_denested$yarn_weight_description <- patterns_denested$yarn_weight_description %>% as.numeric()
The data has now been prepared for use in models. There will be further trimming done around outliers but that will be left to the exploratory data analysis tab where more understanding of the underlying phenomena can be found. Otherwise we have suitable columns for both target and predictor variables, categories which can represent the distribution of patterns available, and relevant information about patterns which can be used to better understand the dataset we have.
The last step is to download the cleaned data set and begin analysis from there.