Data Cleaning

Author

Powell Sheagren

Code

library(tidyverse)
library(jsonlite)
library(reticulate)
library(ravelRy)
library(IRkernel)
library(plyr)
library(dplyr)
use_condaenv("C:/Users/duckd/anaconda3/python.exe")

Introduction

This tab will go through my process of cleaning and prepping the raw data which I captures on the previous data gathering tab. I go through a few different steps and methods and eventually end up with a representative data frame.

Inserting data
Removing unnecessary columns
Splitting nested data frames
Type casting and column renaming
Conclusion

Inserting Data

This step is the easiest and just requires loading the document saved from the data collection phase and reconverting it to a data frame. There was only a slight complication in that when it reads in the file it takes the whole json and nests it in an array, so in order to convert back we have to double index. I’m also going to run a quick distinct function in case of repeated data.

Code

## Importing data
patterns_nested_json <- read_json("../data/pattern_data_json.JSON")
patterns_nested_df <- fromJSON(patterns_nested_json[[1]])
pattern_data_csv <- read.csv("../data/patterns_to_csv.csv")
pattern_text_csv <- read.csv("../data/patterns_data_text.csv")

Removing Unneeded Columns

There are many different data points which were collected with the get_patterns function. Many of them are specific to identification or excess date information about when patterns were created which I may include later but will remove for now. I’m generally looking for information about the needles and yarn needed which includes type or size on both metrics. The other information I hope to get is the pattern type which has entries such as hat, scarf, towel, dishcloth, etc. There aren’t many of those patterns which is why to get significant amounts I aim to get large amounts of data. So, on this first step I will list all of the columns and then select the ones I’m interested in.

Code

patterns_df <- pattern_data_csv %>% select(c("comments_count", # metric of interest
  "currency", "price","currency_symbol","free", # financial aspects
  "published", # when the pattern was published
  "difficulty_average", # recorded difficulty average
  "favorites_count", "projects_count", "queued_projects_count", # other metrics of interest
  "gauge", "gauge_divisor", "gauge_pattern","row_gauge", # technical gauge aspects
  "id", "name", "permalink", # referencing information
  "rating_average","rating_count", # ratings information
  "yardage","yardage_max","yarn_weight_description", # amount and weight of yarn required for pattern
  "yarn_weight", # nested dataframe with weight information, could be redundant
))

dim(patterns_df)

99921
23

Variables were dropped for a couple of reasons. Some had excessive amounts of irrelevant information like the photos column (dropped in data gathering) or were too internally different for analysis like the sizes_available (dropped in data gathering) which was entered differently each time. Many were booleans relating to pattern availability which was irrelevant to a detailed database. Others, lastly, were redundant information which repackaged other variables; in these cases I used the columns that already had the original data instead of the combination. This saved me a few steps here and there. Overall, I have dropped from an initial column amount of 53 to 30, however this amount may increase as I start to expand some of the nested data frames.

Splitting Simpler Nested Dataframes

Out of all of the nested dataframes there are some which just contain 3 columns of ids, values, or other without multiple rows. These I hope to go through and represent as a couple of columns. The other more dense nested dataframes will be parsed later as those often have a couple layers or many more options of data. I will start looking at instances of all the nested data frames.

Code

head(patterns_nested_df)
dim(patterns_nested_df)
patterns_df_denest <- cbind(patterns_df,patterns_nested_df,pattern_text_csv$notes)
dim(patterns_df_denest)

A data.frame: 6 × 5
	pattern_needle_sizes	craft	pattern_categories	pattern_attributes	pattern_type
	<list>	<list>	<list>	<list>	<list>
1	7 , 7 , 4.5 , , FALSE , TRUE , , US 7 - 4.5 mm, 4.5	2 , Knitting, knitting	895 , Other , other-accessories, 337 , Accessories , accessories , 301 , Categories , categories	265 , 267 , 311 , chart , written-pattern , stripes-colorwork	FALSE, 9 , Other, other
2	2 , 2.0 , 2.75 , 1 , TRUE , FALSE , C , 2.75 mm (C), 2.75	1 , Crochet, crochet	306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories	62 , 114 , 150 , 265 , 267 , lace , sleeves , straight , chart , written-pattern	TRUE , 4 , Pullover, pullover
3	5 , 7 , 5 , 7 , 3.75 , 4.5 , , , FALSE , FALSE , TRUE , TRUE , F , , US 5 - 3.75 mm, US 7 - 4.5 mm , 3.75 , 4.5	2 , Knitting, knitting	306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories	3 , 10 , 181 , 205 , 211 , 257 , 265 , 267 , 286 , 88 , unisex , adult , stranded , seamless , top-down , positive-ease , chart , written-pattern, in-the-round , circular-yoke	TRUE , 4 , Pullover, pullover
4	9 , 9 , 5.5 , , FALSE , TRUE , I , US 9 - 5.5 mm, 5.5	2 , Knitting, knitting	306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories	1 , 3 , 9 , 10 , 64 , 66 , 91 , 150 , 204 , 205 , 211 , 257 , 267 , 285 , 286 , 311 , male , unisex , teen , adult , ribbed , stripes , boat-neck , straight , one-piece , seamless , top-down , positive-ease , written-pattern , worked-flat , in-the-round , stripes-colorwork	TRUE , 4 , Pullover, pullover
5	21 , 20 , 9 , 1½ , 2½ , 9 , 2.5 , 3 , 5.5 , , , , FALSE , FALSE , FALSE , TRUE , TRUE , TRUE , , , I , US 1½ - 2.5 mm, US 2½ - 3.0 mm, US 9 - 5.5 mm, 2.5 , 3 , 5.5	2 , Knitting, knitting	339 , Scarf , scarf , 338 , Neck / Torso, neck-torso , 337 , Accessories , accessories , 301 , Categories , categories	8 , 10 , 64 , 267 , 268 , 285 , child , adult , ribbed , written-pattern, video-tutorial , worked-flat	TRUE , 1 , Scarf, scarf
6	7 , 7 , 4.5 , , FALSE , TRUE , , US 7 - 4.5 mm, 4.5	2 , Knitting, knitting	306 , Pullover , pullover , 319 , Sweater , sweater , 302 , Clothing , clothing , 301 , Categories, categories	2 , 10 , 18 , 86 , 103 , 267 , 285 , 295 , female , adult , fitted , seamed , scoop-neck , written-pattern, worked-flat , other-edging	TRUE , 4 , Pullover, pullover

99921
5

99921
29

Overall the path forward here is to collapse the columns instance by instance and then filter out redundant information while keeping some identification for potential merge.

Pattern_needle_sizes

For pattern_needle_size we can record both crochet and knit values and then clear out redundant information. In this process we somehow lost a couple data entries, I’ll look into this and just blast through the rest. I added a distinct because some patterns had multiple needles sets but I don’t really care so I’ll just take one of them at random.

Code

## Example:
patterns_df_denest[5,24]

## precleaning step:
dim_accum <- c()
for(i in 1:nrow(patterns_df_denest)){
  each <- patterns_df_denest[i,24][[1]] %>% class()
  dim_accum <- c(dim_accum,each)
}
patterns_df_denest_prep <- patterns_df_denest

patterns_df_denest_prep$needle_check <- dim_accum

## Unnesting
patterns_df_denest1 <- patterns_df_denest_prep %>% 
  filter(dim_accum == "data.frame") %>%
  unnest(pattern_needle_sizes, names_sep = "_nested_") %>%
  distinct(id,permalink,.keep_all = TRUE) %>% select(-c(
    "pattern_needle_sizes_nested_id","pattern_needle_sizes_nested_us_steel","pattern_needle_sizes_nested_crochet","pattern_needle_sizes_nested_knitting","pattern_needle_sizes_nested_pretty_metric","needle_check" # removing the other redundant columns
  ))

A data.frame: 3 × 9
	id	us	metric	us_steel	crochet	knitting	hook	name	pretty_metric
	<int>	<chr>	<dbl>	<chr>	<lgl>	<lgl>	<chr>	<chr>	<chr>
1	21	1½	2.5		FALSE	TRUE		US 1½ - 2.5 mm	2.5
2	20	2½	3.0		FALSE	TRUE		US 2½ - 3.0 mm	3
3	9	9	5.5		FALSE	TRUE	I	US 9 - 5.5 mm	5.5

Craft

There is a similar process for craft which may also have the side effect of allowing us to remove some of the columns from the needles dataset.

Code

## Example
patterns_df_denest1[3,28][[1]]

patterns_df_denest2 <- patterns_df_denest1 %>% 
  unnest(craft, names_sep = "_nested_") %>%
  distinct(id,permalink,.keep_all = TRUE) %>% select(-c(
    "craft_nested_id","craft_nested_permalink"
  ))

A data.frame: 1 × 3

id name permalink

<int> <chr> <chr>

1 2 Knitting knitting

A data.frame: 1 × 3
	id	name	permalink
1	2	Knitting	knitting

Pattern_type

This is an important column to be able to separate out as well as this information will be acting as the main target column for all of the prediction datasets.

Code

## Example
patterns_df_denest2[6,31][[1]]

patterns_df_denest3 <- patterns_df_denest2 %>% 
  unnest(pattern_type,names_sep = "_nested_") %>%
  distinct(id,permalink,.keep_all = TRUE) %>% select(-c(
    "pattern_type_nested_id","pattern_type_nested_name","pattern_type_nested_clothing"
  ))

A data.frame: 1 × 4

clothing id name permalink

<lgl> <int> <chr> <chr>

1 TRUE 4 Pullover pullover

A data.frame: 1 × 4
	clothing	id	name	permalink
1	TRUE	4	Pullover	pullover

We were successful although the naming convention may be changed later on as the names of the columns are fairly dense.

Pattern_categories

The pattern categories may not have as much importance later on as there are far too many categories although there may be room for visualizations in the categories.

Code

## Example
patterns_df_denest3[[30277,29]][[1]] %>% class()

#1:nrow(patterns_df_denest3)

for(i in 1:nrow(patterns_df_denest3)){
    piece <- patterns_df_denest3[[i,29]] 
    if(class(piece[[1]]) == "data.frame"){
      piece <- piece %>%
                  as.data.frame() %>%
                  select(permalink,parent.permalink)
      piece <- piece[1,]
    }
    else{
      piece <- c(NA,NA) %>% t() %>% as.data.frame()
      colnames(piece) = c("permalink","parent.permalink")
    }
                   
    if(i == 1){
      category_accum <- piece
    } else{
      category_accum <- rbind(category_accum,piece)
    }
}

colnames(category_accum) <- c("category_permalink","category_parent_permalink")

nrow(category_accum)
nrow(patterns_df_denest3)

## Joining
patterns_df_denest4 <- cbind(patterns_df_denest3,category_accum) %>%
                          select(-pattern_categories)

'list'

96537

Pattern_attributes

Pattern attributes are another fairly difficult nested frame to consider as some patterns have multiple categories or categories are further nested. For this analysis I attempted it with the code below but have left it for future analysis.

Code

## Example:
patterns_df_denest4[[1,29]]

## Getting list of attributes

attributes_total <- c()
for(i in 1:nrow(patterns_df_denest4)){
  attributes <- patterns_df_denest4[[i,29]]$permalink
  attributes_total <- c(attributes_total,attributes) %>% unique()
}

attributes_table <- data.frame(t(rep(NA,length(attributes_total))))
colnames(attributes_table) <- attributes_total[order(attributes_total)]
attributes_table <- attributes_table[-1,]
attributes_table_og <- attributes_table

for(i in 1:1){
  print(i/21)
  print(Sys.time())
  attributes_table_i <- attributes_table_og
  for(g in 1:(nrow(patterns_df_denest4)/21)){
    print(g)
    attributes <- patterns_df_denest4[[g,29]]$permalink
    attributes_g <- data.frame(t(rep(1,length(attributes))))
    colnames(attributes_g) <- attributes
    attributes_table_i <- rbind.fill(attributes_table_i,attributes_g)
  }
  rbind(attributes_table,attributes_table_i)
}

Type casting and column renaming

Now that most of the nested dataframes have been unpacked, it’s time to check for column types. Some of the values are characters when I want them to be Integers so I will be going through it all and getting the desired values. The names are also horrendous given the multiple rounds of unnesting so I will go through and adjust for relevance, finally reordering the columns for ease of use in the future.

Code

patterns_denested <- patterns_df_denest4

colnames(patterns_denested)

colnames(patterns_denested)[24:31] <- c("needle_sizes_us","needle_sizes_metric","hook_size","needle_sizes_name","craft","pattern_attributes","pattern_type","pattern_desc")

With the columns as they are we will now be shifting some of the columns to numeric measurements. Most columns are already in character form so the categorical or word based groups are not as important to adapt. The exception to this will be the yarn_weight_decription column as there are many names for sizes but no specific ranking so I will be grouping some of them together.

Code

numeric_columns <- c(1,3,7,8,9,10,11,12,14,15,18,19,20,21,24)

for(i in 1:ncol(patterns_denested)){
  if(i %in% numeric_columns){
    patterns_denested[,i] <- patterns_denested[,i] %>% as.numeric()
  }
}

levels(factor(patterns_denested$yarn_weight_description))
for(i in 1:nrow(patterns_denested)){
  value <- patterns_denested[i,22]
  if(value == "Lace"|value == "Thread"|value == "Light Fingering"){
    patterns_denested[i,22] <- 0
  } else if(value == "Fingering (14 wpi)"|value == "Cobweb"){
    patterns_denested[i,22] <- 1
  } else if(value == "Sport (12 wpi)"|value == "DK / Sport"){
    patterns_denested[i,22] <- 2
  }  else if(value == "DK (11 wpi)"| value == ""| value == "Any gauge - designed for any gauge"){
    patterns_denested[i,22] <- 3
  } else if(value == "Aran / Worsted"| value == "Aran (8 wpi)"|value == "Worsted (9 wpi)"){
    patterns_denested[i,22] <- 4
  } else if(value == "Bulky (7 wpi)"|value == "Super Bulky (5-6 wpi)"){
    patterns_denested[i,22] <- 5
  } else if(value == "Jumbo (0-4 wpi)"){
    patterns_denested[i,22] <- 6
  }
}

patterns_denested$yarn_weight_description <- patterns_denested$yarn_weight_description %>% as.numeric()

Conclusion

The data has now been prepared for use in models. There will be further trimming done around outliers but that will be left to the exploratory data analysis tab where more understanding of the underlying phenomena can be found. Otherwise we have suitable columns for both target and predictor variables, categories which can represent the distribution of patterns available, and relevant information about patterns which can be used to better understand the dataset we have.

The last step is to download the cleaned data set and begin analysis from there.

Downloading data sets

Code

patterns_df_denest_done <- patterns_denested %>% select(-pattern_attributes)
write.csv(patterns_df_denest_done,"../data/pattern_data_cleaned.csv")

Introduction

Contents

Inserting Data

Removing Unneeded Columns

Splitting Simpler Nested Dataframes

Pattern_needle_sizes

Craft

Pattern_type

Pattern_categories

Pattern_attributes

Type casting and column renaming

Conclusion

Downloading data sets