it should still fix some issue I need to find the sum of columns in a every row. , You can define the schema and try the below approach. input: code :
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
A = LOAD 'input' USING PigStorage(',') AS(f1:chararray,f2:int,f3:int,f4:int,f5:int,f6:int);
B = FOREACH A GENERATE f1,SUM(TOBAG(f2..));
DUMP B;
(A,96)
(B,95)
(C,68)
Share :

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns
By : Snehal
Date : March 29 2020, 07:55 AM
hope this fix your issue I'm sure there is a more succint way of doing it, but this should get you started: code :
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4cea62d471e81fca69a8278c7da'][0]
# 'Grunge'

2D Matrix: Finding and deleting columns that are subsets of other columns
By : Jack27
Date : March 29 2020, 07:55 AM
this one helps. Since the A matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix A to list of sets D is not as fast it could be. Any ideas on how to do this better are appreciated. Solution code :
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of nonzero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B

Taking two excel sheets in same workbook and finding same values in certain columns and copy data from other columns
By : user5649904
Date : March 29 2020, 07:55 AM
I hope this helps . to Scott, here is what I was trying to do. I guess I was kind of close. =INDEX(Sheet2!L$3:L$14119,MATCH($E3,Sheet2!$K$3:$K$14119,0))

Change value at columns when finding values at column on the right for multiple pairs of columns without loop
By : thomas120
Date : March 29 2020, 07:55 AM
it fixes the issue You can locate all 4 and 5 values with ismember and then circshift the resulting boolean to the left and replace with NaN code :
bool = ismember(data, [4 5]);
shifted = circshift(bool, [0 1]);
data(shifted) = NaN;
data(circshift(ismember(data, [4 5]), [0 1])) = NaN;
to_be_nan = ismember(data(:,2:end), [4 5]);
to_be_nan(:,end+1) = false;
data(to_be_nan) = NaN;

Finding duplicates across multiple columns repeated in more than certain number of columns
By : Fangpeng Liu
Date : March 29 2020, 07:55 AM
will help you With subtle changes to akrun code, I guess I found out this is what I wanted: names(table(unlist(d1)))[table(unlist(d1))>=3]

