Does that help Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames code :
res < do.call(cbind,lapply(df, function(x) {
x1 < table(factor(x, levels=0:5,
labels=c('No', 'Poor', 'Somewhat Effective',
'Good', 'Very Good', 'NA') ))
cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
colnames(res) < paste(rep(paste0('V',1:7),each=2),
colnames(res),sep=".")
head(res,2)
# V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
#No 9 100 9 56.25 9 100 9
#Poor 0 0 1 6.25 0 0 0
# V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
#No 81.82 9 100 8 66.67 8 80
#Poor 0.00 0 0 0 0.00 0 0
Share :

Generate lists from dataframe with even representation between multiple categorical variables
By : silenove
Date : March 29 2020, 07:55 AM
To fix the issue you can do My co worker has found a solution, and the solution I think better explains the problem as well. code :
import pandas as pd
import random
import math
import itertools
def n_per_group(n, n_groups):
"""find the size of each group when splitting n people into n_groups"""
n_per_group = math.floor(n/n_groups)
rem = n % n_per_group
return [n_per_group if k<rem else n_per_group + 1 for k in range(n_groups)]
def assign_groups(n, n_groups):
"""split the n people in n_groups pretty evenly, and randomize"""
n_per = n_per_group(n ,n_groups)
groups = list(itertools.chain(*[i[0]*[i[1]] for i in zip(n_per,list(range(n_groups)))]))
random.shuffle(groups)
return groups
def group_diff(df, g1, g2):
"""calculate the between group score difference"""
a = df.loc[df['group']==g1, ~df.columns.isin(('A','group'))].sum()
b = df.loc[df['group']==g2, ~df.columns.isin(('A','group'))].sum()
#print(a)
return abs(ab).sum()
def swap_groups(df, row1, row2):
"""swap the groups of the people in row1 and row2"""
r1group = df.loc[row1,'group']
r2group = df.loc[row2,'group']
df.loc[row2,'group'] = r1group
df.loc[row1,'group'] = r2group
return df
def row_to_group(df, row):
"""get the group associated to a given row"""
return df.loc[row,'group']
def swap_and_score(df, row1, row2):
"""
given two rows, calculate the between group scores
originally, and if we swap rows. If the score difference
is minimized by swapping, return the swapped df, otherwise
return the orignal (swap back)
"""
#orig = df
g1 = row_to_group(df,row1)
g2 = row_to_group(df,row2)
s1 = group_diff(df,g1,g2)
df = swap_groups(df, row1, row2)
s2 = group_diff(df,g1,g2)
#print(s1,s2)
if s1>s2:
#print('swap')
return df
else:
return swap_groups(df, row1, row2)
def pairwise_scores(df):
d = []
for i in range(n_groups):
for j in range(i+1,n_groups):
d.append(group_diff(df,i,j))
return d
# one hot encode and copy
df_dum = pd.get_dummies(df, columns=['B', 'C', 'D']).copy(deep=True)
#drop extra cols as needed
groups = assign_groups(n, n_groups)
df_dum['group'] = groups
# iterate
for _ in range(5000):
rows = random.choices(list(range(n)),k=2)
#print(rows)
df_dum = swap_and_score(df_dum,rows[0],rows[1])
#print(pairwise_scores(df))
print(pairwise_scores(df_dum))
df['group'] = df_dum.group
df['orig_groups'] = groups
for i in range(n_groups):
for j in range(i+1,n_groups):
a = df_dum.loc[df_dum['group']==3, ~df_dum.columns.isin(('A','group'))].sum()
b = df_dum.loc[df_dum['group']==0, ~df_dum.columns.isin(('A','group'))].sum()
print(ab)

How to generate dummy variables from two categorical variables?
By : Regis Dantas
Date : March 29 2020, 07:55 AM
should help you out Have you tried looking at xi ( https://www.stata.com/manuals13/rxi.pdf)? It will create dummies for each of the categorical variables and for the interaction of those two. So if you do:

Create dummy variables from all categorical variables in a dataframe
By : user2146026
Date : March 29 2020, 07:55 AM
this one helps. I need to oneencode all categorical columns in a dataframe. I found something like this: , Also oneliner with fastDummies package. code :
fastDummies::dummy_cols(customers)
id gender mood outcome gender_male gender_female mood_happy mood_sad
1 10 male happy 1 1 0 1 0
2 20 female sad 1 0 1 0 1
3 30 female happy 0 0 1 1 0
4 40 male sad 0 1 0 0 1
5 50 female happy 0 0 1 1 0

Reshape dataframe from categorical variables to only binary variables
By : user2679496
Date : March 29 2020, 07:55 AM
Hope this helps What you're trying to create are called dummy variables, an in R those are created using model.matrix(). Your specific application is a little special however, so some extra fiddling is required. code :
dtf < data.frame(id=20:24,
f=c("a", "b", "c", "a", "b"),
g=c("A", "C", NA, "B", "A"),
h=c("P", "R", "Q", NA, "Q"))
# (the first column is not a categorical variable, hence not included)
dtf2 < dtf[1]
# Preallocate a list of the appropriate length
l < vector("list", ncol(dtf2))
# Loop over each column in dtf2 and
for (j in 1:ncol(dtf2)) {
# Make sure to include NA as a level
data < dtf2[j]
data[] < factor(dtf2[,j], exclude=NULL)
# Generate contrasts that include all levels
cont < contrasts(data[[1]], contrasts=FALSE)
# Create dummy variables using the above contrasts, excluding intercept
# Formula syntax is the same as in e.g. lm(), except the response
# variable (term to the left of ~) is not included.
# '1' means no intercept, '.' means all variables
modmat < model.matrix(~ 1+., data=data, contrasts.arg=cont)
# Find rows to fill with NA
nacols < grep(".*NA$", colnames(modmat))
# Only do the operations if an NAcolumn was found
if (length(nacols > 0)) {
narows < rowSums(modmat[, nacols, drop=FALSE]) > 0
modmat[narows,] < NA
modmat < modmat[,nacols]
}
l[[j]] < modmat
}
data.frame(dtf[1], do.call(cbind, l))
# id fa fb fc gA gB gC hP hQ hR
# 1 20 1 0 0 1 0 0 1 0 0
# 2 21 0 1 0 0 0 1 0 0 1
# 3 22 0 0 1 NA NA NA 0 1 0
# 4 23 1 0 0 0 1 0 NA NA NA
# 5 24 0 1 0 1 0 0 0 1 0

Analysis of changes of categorical variables in the dataframe
By : user3534446
Date : March 29 2020, 07:55 AM
Hope that helps The simplest thing you can do is to create origindestination tuples by zipping each user column with its shifted self and to then pass the tuples to a Counter object. code :
import pandas as pd
from collections import Counter
df.fillna(method='ffill', inplace=True)
# Create a counter object and pass it the origindestination tuples
counter = Counter()
for col in df.columns:
routes = list(zip(df[col].shift(1, fill_value=df[col][0]), df[col]))
routes = [(k, v) for k, v in routes if k != v]
counter.update(routes)
counter.most_common(3)
counter.most_common(3)
Out[76]:
[(('Spain', 'USA'), 3),
(('Portugal', 'Spain'), 2),
(('Bulgaria', 'Portugal'), 1)]

