logo
down
shadow

Generate crosstabulations from dataframe of categorical variables in survey


Generate crosstabulations from dataframe of categorical variables in survey

By : user2956089
Date : November 22 2020, 10:54 AM
Does that help Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames
code :
  res <-  do.call(cbind,lapply(df, function(x) {
            x1 <- table(factor(x, levels=0:5,
               labels=c('No', 'Poor', 'Somewhat Effective', 
                               'Good', 'Very Good', 'NA') ))
             cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
 colnames(res) <- paste(rep(paste0('V',1:7),each=2),
                                     colnames(res),sep=".")

  head(res,2)
  #     V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
  #No         9        100       9      56.25       9        100       9
  #Poor       0          0       1       6.25       0          0       0
  #     V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
  #No        81.82       9        100       8      66.67       8         80
  #Poor       0.00       0          0       0       0.00       0          0


Share : facebook icon twitter icon
Generate lists from dataframe with even representation between multiple categorical variables

Generate lists from dataframe with even representation between multiple categorical variables


By : silenove
Date : March 29 2020, 07:55 AM
To fix the issue you can do My co worker has found a solution, and the solution I think better explains the problem as well.
code :
import pandas as pd
import random
import math
import itertools

def n_per_group(n, n_groups):
    """find the size of each group when splitting n people into n_groups"""
    n_per_group = math.floor(n/n_groups)
    rem = n % n_per_group
    return [n_per_group if k<rem else n_per_group + 1 for k in range(n_groups)]

def assign_groups(n, n_groups):
    """split the n people in n_groups pretty evenly, and randomize"""
    n_per = n_per_group(n ,n_groups)
    groups = list(itertools.chain(*[i[0]*[i[1]] for i in zip(n_per,list(range(n_groups)))]))
    random.shuffle(groups)
    return groups

def group_diff(df, g1, g2):
    """calculate the between group score difference"""
    a = df.loc[df['group']==g1, ~df.columns.isin(('A','group'))].sum()
    b = df.loc[df['group']==g2, ~df.columns.isin(('A','group'))].sum()
    #print(a)
    return abs(a-b).sum()

def swap_groups(df, row1, row2):
    """swap the groups of the people in row1 and row2"""
    r1group = df.loc[row1,'group']
    r2group = df.loc[row2,'group']
    df.loc[row2,'group'] = r1group
    df.loc[row1,'group'] = r2group
    return df

def row_to_group(df, row):
    """get the group associated to a given row"""
    return df.loc[row,'group']

def swap_and_score(df, row1, row2):
    """
    given two rows, calculate the between group scores
    originally, and if we swap rows. If the score difference
    is minimized by swapping, return the swapped df, otherwise
    return the orignal (swap back)
    """
    #orig = df
    g1 = row_to_group(df,row1)
    g2 = row_to_group(df,row2)
    s1 = group_diff(df,g1,g2)
    df = swap_groups(df, row1, row2)
    s2 = group_diff(df,g1,g2)
    #print(s1,s2)
    if s1>s2:
        #print('swap')
        return df
    else:
        return swap_groups(df, row1, row2)

def pairwise_scores(df):
    d = []
    for i in range(n_groups):
        for j in range(i+1,n_groups):
            d.append(group_diff(df,i,j))
    return d

# one hot encode and copy
df_dum = pd.get_dummies(df, columns=['B', 'C', 'D']).copy(deep=True)

#drop extra cols as needed

groups = assign_groups(n, n_groups)
df_dum['group'] = groups

# iterate
for _ in range(5000):
    rows = random.choices(list(range(n)),k=2)
    #print(rows)
    df_dum = swap_and_score(df_dum,rows[0],rows[1])
    #print(pairwise_scores(df))

print(pairwise_scores(df_dum))

df['group'] = df_dum.group
df['orig_groups'] = groups

for i in range(n_groups):
        for j in range(i+1,n_groups):
            a = df_dum.loc[df_dum['group']==3, ~df_dum.columns.isin(('A','group'))].sum()
            b = df_dum.loc[df_dum['group']==0, ~df_dum.columns.isin(('A','group'))].sum()
            print(a-b)
How to generate dummy variables from two categorical variables?

How to generate dummy variables from two categorical variables?


By : Regis Dantas
Date : March 29 2020, 07:55 AM
should help you out Have you tried looking at xi (https://www.stata.com/manuals13/rxi.pdf)? It will create dummies for each of the categorical variables and for the interaction of those two. So if you do:
code :
 xi i.state*i.year
Create dummy variables from all categorical variables in a dataframe

Create dummy variables from all categorical variables in a dataframe


By : user2146026
Date : March 29 2020, 07:55 AM
this one helps. I need to one-encode all categorical columns in a dataframe. I found something like this: , Also one-liner with fastDummies package.
code :
fastDummies::dummy_cols(customers)

  id gender  mood outcome gender_male gender_female mood_happy mood_sad
1 10   male happy       1           1             0          1        0
2 20 female   sad       1           0             1          0        1
3 30 female happy       0           0             1          1        0
4 40   male   sad       0           1             0          0        1
5 50 female happy       0           0             1          1        0
Reshape dataframe from categorical variables to only binary variables

Reshape dataframe from categorical variables to only binary variables


By : user2679496
Date : March 29 2020, 07:55 AM
Hope this helps What you're trying to create are called dummy variables, an in R those are created using model.matrix(). Your specific application is a little special however, so some extra fiddling is required.
code :
dtf <- data.frame(id=20:24, 
                  f=c("a", "b", "c", "a", "b"), 
                  g=c("A", "C", NA, "B", "A"),
                  h=c("P", "R", "Q", NA, "Q"))

# (the first column is not a categorical variable, hence not included)
dtf2 <- dtf[-1]

# Pre-allocate a list of the appropriate length
l <- vector("list", ncol(dtf2))

# Loop over each column in dtf2 and 
for (j in 1:ncol(dtf2)) {
    # Make sure to include NA as a level 
    data <- dtf2[j]
    data[] <- factor(dtf2[,j], exclude=NULL)

    # Generate contrasts that include all levels
    cont <- contrasts(data[[1]], contrasts=FALSE)

    # Create dummy variables using the above contrasts, excluding intercept
    # Formula syntax is the same as in e.g. lm(), except the response
    # variable (term to the left of ~) is not included. 
    # '-1' means no intercept, '.' means all variables
    modmat <- model.matrix(~ -1+., data=data, contrasts.arg=cont)

    # Find rows to fill with NA
    nacols <- grep(".*NA$", colnames(modmat))

    # Only do the operations if an NA-column was found
    if (length(nacols > 0)) {
       narows <- rowSums(modmat[, nacols, drop=FALSE]) > 0
       modmat[narows,] <- NA
       modmat <- modmat[,-nacols]
    }

    l[[j]] <- modmat
}

data.frame(dtf[1], do.call(cbind, l))
#   id fa fb fc gA gB gC hP hQ hR
# 1 20  1  0  0  1  0  0  1  0  0
# 2 21  0  1  0  0  0  1  0  0  1
# 3 22  0  0  1 NA NA NA  0  1  0
# 4 23  1  0  0  0  1  0 NA NA NA
# 5 24  0  1  0  1  0  0  0  1  0
Analysis of changes of categorical variables in the dataframe

Analysis of changes of categorical variables in the dataframe


By : user3534446
Date : March 29 2020, 07:55 AM
Hope that helps The simplest thing you can do is to create origin-destination tuples by zipping each user column with its shifted self and to then pass the tuples to a Counter object.
code :
import pandas as pd
from collections import Counter

df.fillna(method='ffill', inplace=True)

# Create a counter object and pass it the origin-destination tuples
counter = Counter()
for col in df.columns:
    routes = list(zip(df[col].shift(1, fill_value=df[col][0]), df[col]))
    routes = [(k, v) for k, v in routes if k != v]
    counter.update(routes)
counter.most_common(3)
counter.most_common(3)
Out[76]: 
[(('Spain', 'USA'), 3),
 (('Portugal', 'Spain'), 2),
 (('Bulgaria', 'Portugal'), 1)]
Related Posts Related Posts :
  • Re coding in R using complicated statement
  • accumulating functions and closures in R
  • How do you combine two columns into a new column in a dataframe made of two or more different csv files?
  • Twitter authentication fails
  • Summing Values of One Vector Conditional on Values of Another Vector
  • draw cube into 3D scatterplot in RGL
  • lme4 translate formula to code in 3-level model
  • How to draw single axis plot in R
  • Combine geom_tile() and facet_grid/facet_wrap and remove space between tiles (ggplot2)
  • Use snpStats with R version 3.0.1
  • Makefile gives strange error while compiling markdown file into .docx file
  • How to determine whether a points lies in an ellipse
  • Summarize data already grouped in r
  • Is the bigvis package for R not available for R version 3.0.1?
  • Operator overloading in R reference classes
  • How to enable user to switch between ggplot2 and gVis graphs in R Shiny?
  • Is there an easy way to separate categorical vs continuous variables into two dataset in R
  • Correct previous year by id within R
  • Installation of rdyncall package for R
  • ggplot2 plot that evaluates the percentage and mean of a third variable at intersecting points
  • Error Handling with Lapply
  • data.table - split multiple columns
  • How to compute the overall mean for several files in R?
  • R: Graph Plotting: Subscripts in the legend like LaTeX
  • Restructuring data in R
  • Distance of pointsfrom cluster centers after K means clustering
  • R incorrect value of date function
  • Package "Imports" not loading in R development package
  • r - run a user defined function several times by taking column elements as parameters
  • Create input$selection to subset data AND radiobuttons to choose plot type in Shiny
  • Restructure output of R summary function
  • New behavior in data.table? .N / something with `by` (calculate proportion)
  • search certain number vector in R
  • R version doesn't support quartz graphic device - RStudio won't plot
  • Referencing a function parameter in R
  • How to synchronize signals using a cross-correlation and FFT in R?
  • Plotting coefficients and corresponding confidence intervals
  • passing expressions to curve() within a function
  • More effective merging of matched column with duplicates in data.table
  • Easy way to export multiple data.frame to multiple Excel worksheets
  • R Foreach Iterator - Walkforward
  • Table format and output in R
  • Restructuring data and duplicating rows in R
  • use ggplot2 to plot two lines with ribbons
  • how to plot a graph on lattice with two different colors
  • How can I keep a date formatted in R using sqldf?
  • Generating simulation data based on a specified distribution
  • Joining list of data frames in R
  • Subset data in R
  • R: How to avoid 2 'for' loops in R in this function
  • + signs appearing in console in R
  • how to create a dataframe form a lists within a list in R
  • Best way to combine and keep columns
  • Using identify and attach in a function
  • Apply function to each submatrix
  • How to assign regular strings for quarterly and monthly observation labels to the row names of a data frame?
  • Adjust hexbin legend breaks
  • Different lowess curves in plot and qplot in R
  • Extract words only with R
  • switch case: several equivalent cases expressions in r
  • shadow
    Privacy Policy - Terms - Contact Us © ourworld-yourmove.org