  C RUBY-ON-RAILS MYSQL ASP.NET DEVELOPMENT RUBY .NET LINUX SQL-SERVER REGEX WINDOWS ALGORITHM ECLIPSE VISUAL-STUDIO STRING SVN PERFORMANCE APACHE-FLEX UNIT-TESTING SECURITY LINQ UNIX MATH EMAIL OOP LANGUAGE-AGNOSTIC VB6 MSBUILD # Generate crosstabulations from dataframe of categorical variables in survey  » r » Generate crosstabulations from dataframe of categorical variables in survey

By : user2956089
Date : November 22 2020, 10:54 AM
Does that help Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames code :
``````  res <-  do.call(cbind,lapply(df, function(x) {
x1 <- table(factor(x, levels=0:5,
labels=c('No', 'Poor', 'Somewhat Effective',
'Good', 'Very Good', 'NA') ))
cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
colnames(res) <- paste(rep(paste0('V',1:7),each=2),
colnames(res),sep=".")

#     V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
#No         9        100       9      56.25       9        100       9
#Poor       0          0       1       6.25       0          0       0
#     V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
#No        81.82       9        100       8      66.67       8         80
#Poor       0.00       0          0       0       0.00       0          0
`````` ## Generate lists from dataframe with even representation between multiple categorical variables

By : silenove
Date : March 29 2020, 07:55 AM
To fix the issue you can do My co worker has found a solution, and the solution I think better explains the problem as well.
code :
``````import pandas as pd
import random
import math
import itertools

def n_per_group(n, n_groups):
"""find the size of each group when splitting n people into n_groups"""
n_per_group = math.floor(n/n_groups)
rem = n % n_per_group
return [n_per_group if k<rem else n_per_group + 1 for k in range(n_groups)]

def assign_groups(n, n_groups):
"""split the n people in n_groups pretty evenly, and randomize"""
n_per = n_per_group(n ,n_groups)
groups = list(itertools.chain(*[i*[i] for i in zip(n_per,list(range(n_groups)))]))
random.shuffle(groups)
return groups

def group_diff(df, g1, g2):
"""calculate the between group score difference"""
a = df.loc[df['group']==g1, ~df.columns.isin(('A','group'))].sum()
b = df.loc[df['group']==g2, ~df.columns.isin(('A','group'))].sum()
#print(a)
return abs(a-b).sum()

def swap_groups(df, row1, row2):
"""swap the groups of the people in row1 and row2"""
r1group = df.loc[row1,'group']
r2group = df.loc[row2,'group']
df.loc[row2,'group'] = r1group
df.loc[row1,'group'] = r2group
return df

def row_to_group(df, row):
"""get the group associated to a given row"""
return df.loc[row,'group']

def swap_and_score(df, row1, row2):
"""
given two rows, calculate the between group scores
originally, and if we swap rows. If the score difference
is minimized by swapping, return the swapped df, otherwise
return the orignal (swap back)
"""
#orig = df
g1 = row_to_group(df,row1)
g2 = row_to_group(df,row2)
s1 = group_diff(df,g1,g2)
df = swap_groups(df, row1, row2)
s2 = group_diff(df,g1,g2)
#print(s1,s2)
if s1>s2:
#print('swap')
return df
else:
return swap_groups(df, row1, row2)

def pairwise_scores(df):
d = []
for i in range(n_groups):
for j in range(i+1,n_groups):
d.append(group_diff(df,i,j))
return d

# one hot encode and copy
df_dum = pd.get_dummies(df, columns=['B', 'C', 'D']).copy(deep=True)

#drop extra cols as needed

groups = assign_groups(n, n_groups)
df_dum['group'] = groups

# iterate
for _ in range(5000):
rows = random.choices(list(range(n)),k=2)
#print(rows)
df_dum = swap_and_score(df_dum,rows,rows)
#print(pairwise_scores(df))

print(pairwise_scores(df_dum))

df['group'] = df_dum.group
df['orig_groups'] = groups

for i in range(n_groups):
for j in range(i+1,n_groups):
a = df_dum.loc[df_dum['group']==3, ~df_dum.columns.isin(('A','group'))].sum()
b = df_dum.loc[df_dum['group']==0, ~df_dum.columns.isin(('A','group'))].sum()
print(a-b)
`````` ## How to generate dummy variables from two categorical variables?

By : Regis Dantas
Date : March 29 2020, 07:55 AM
should help you out Have you tried looking at xi (https://www.stata.com/manuals13/rxi.pdf)? It will create dummies for each of the categorical variables and for the interaction of those two. So if you do:
code :
`````` xi i.state*i.year
`````` ## Create dummy variables from all categorical variables in a dataframe

By : user2146026
Date : March 29 2020, 07:55 AM
this one helps. I need to one-encode all categorical columns in a dataframe. I found something like this: , Also one-liner with fastDummies package.
code :
``````fastDummies::dummy_cols(customers)

id gender  mood outcome gender_male gender_female mood_happy mood_sad
1 10   male happy       1           1             0          1        0
2 20 female   sad       1           0             1          0        1
3 30 female happy       0           0             1          1        0
4 40   male   sad       0           1             0          0        1
5 50 female happy       0           0             1          1        0
`````` ## Reshape dataframe from categorical variables to only binary variables

By : user2679496
Date : March 29 2020, 07:55 AM
Hope this helps What you're trying to create are called dummy variables, an in R those are created using model.matrix(). Your specific application is a little special however, so some extra fiddling is required.
code :
``````dtf <- data.frame(id=20:24,
f=c("a", "b", "c", "a", "b"),
g=c("A", "C", NA, "B", "A"),
h=c("P", "R", "Q", NA, "Q"))

# (the first column is not a categorical variable, hence not included)
dtf2 <- dtf[-1]

# Pre-allocate a list of the appropriate length
l <- vector("list", ncol(dtf2))

# Loop over each column in dtf2 and
for (j in 1:ncol(dtf2)) {
# Make sure to include NA as a level
data <- dtf2[j]
data[] <- factor(dtf2[,j], exclude=NULL)

# Generate contrasts that include all levels
cont <- contrasts(data[], contrasts=FALSE)

# Create dummy variables using the above contrasts, excluding intercept
# Formula syntax is the same as in e.g. lm(), except the response
# variable (term to the left of ~) is not included.
# '-1' means no intercept, '.' means all variables
modmat <- model.matrix(~ -1+., data=data, contrasts.arg=cont)

# Find rows to fill with NA
nacols <- grep(".*NA\$", colnames(modmat))

# Only do the operations if an NA-column was found
if (length(nacols > 0)) {
narows <- rowSums(modmat[, nacols, drop=FALSE]) > 0
modmat[narows,] <- NA
modmat <- modmat[,-nacols]
}

l[[j]] <- modmat
}

data.frame(dtf, do.call(cbind, l))
#   id fa fb fc gA gB gC hP hQ hR
# 1 20  1  0  0  1  0  0  1  0  0
# 2 21  0  1  0  0  0  1  0  0  1
# 3 22  0  0  1 NA NA NA  0  1  0
# 4 23  1  0  0  0  1  0 NA NA NA
# 5 24  0  1  0  1  0  0  0  1  0
`````` ## Analysis of changes of categorical variables in the dataframe

By : user3534446
Date : March 29 2020, 07:55 AM
Hope that helps The simplest thing you can do is to create origin-destination tuples by zipping each user column with its shifted self and to then pass the tuples to a Counter object.
code :
``````import pandas as pd
from collections import Counter

df.fillna(method='ffill', inplace=True)

# Create a counter object and pass it the origin-destination tuples
counter = Counter()
for col in df.columns:
routes = list(zip(df[col].shift(1, fill_value=df[col]), df[col]))
routes = [(k, v) for k, v in routes if k != v]
counter.update(routes)
counter.most_common(3)
``````
``````counter.most_common(3)
Out:
[(('Spain', 'USA'), 3),
(('Portugal', 'Spain'), 2),
(('Bulgaria', 'Portugal'), 1)]
`````` 