logo
down
shadow

Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files


Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files

By : user2954335
Date : November 22 2020, 01:01 AM
wish of those help HDF5 Advantages: Organization, flexibility, interoperability
Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it's very useful in practice.
code :
In [1]: import numpy as np

In [2]: arr = np.arange(4*6*6).reshape(4,6,6)

In [3]: arr
Out[3]:
array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]],

       [[108, 109, 110, 111, 112, 113],
        [114, 115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124, 125],
        [126, 127, 128, 129, 130, 131],
        [132, 133, 134, 135, 136, 137],
        [138, 139, 140, 141, 142, 143]]])
In [4]: arr.ravel(order='C')
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])
In [5]: arr[0,:,:]
Out[5]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])
In [6]: arr[:,:,0]
Out[6]:
array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102],
       [108, 114, 120, 126, 132, 138]])
nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
    for j in range(0, ny, 2):
        for k in range(0, nz, 2):
            slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))

chunked = np.hstack([arr[chunk].ravel() for chunk in slices])
array([  0,   1,   6,   7,  36,  37,  42,  43,   2,   3,   8,   9,  38,
        39,  44,  45,   4,   5,  10,  11,  40,  41,  46,  47,  12,  13,
        18,  19,  48,  49,  54,  55,  14,  15,  20,  21,  50,  51,  56,
        57,  16,  17,  22,  23,  52,  53,  58,  59,  24,  25,  30,  31,
        60,  61,  66,  67,  26,  27,  32,  33,  62,  63,  68,  69,  28,
        29,  34,  35,  64,  65,  70,  71,  72,  73,  78,  79, 108, 109,
       114, 115,  74,  75,  80,  81, 110, 111, 116, 117,  76,  77,  82,
        83, 112, 113, 118, 119,  84,  85,  90,  91, 120, 121, 126, 127,
        86,  87,  92,  93, 122, 123, 128, 129,  88,  89,  94,  95, 124,
       125, 130, 131,  96,  97, 102, 103, 132, 133, 138, 139,  98,  99,
       104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])
In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0,  1],
        [ 6,  7]],

       [[36, 37],
        [42, 43]]])
import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'
import sys
import h5py

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    f = h5py.File('/tmp/test.hdf5', 'r')
    return f['seismic_volume']

def z_slice(data):
    return data[:,:,0]

def x_slice(data):
    return data[0,:,:]

main()
import numpy as np
import sys

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol'
    shape = 621, 4991, 2600
    header_len = 3072

    data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len,
                     order='F', shape=shape, dtype=np.uint8)
    return data

def z_slice(data):
    dat = np.empty(data.shape[:2], dtype=data.dtype)
    dat[:] = data[:,:,0]
    return dat

def x_slice(data):
    dat = np.empty(data.shape[1:], dtype=data.dtype)
    dat[:] = data[0,:,:]
    return dat

main()
jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py z
python chunked_hdf.py z  0.64s user 0.28s system 3% cpu 23.800 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py x
python chunked_hdf.py x  0.12s user 0.30s system 1% cpu 21.856 total
jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py z
python memmapped_array.py z  0.07s user 0.04s system 28% cpu 0.385 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py x
python memmapped_array.py x  2.46s user 37.24s system 0% cpu 3:35:26.85 total


Share : facebook icon twitter icon
goes out of memory when saving large array with HDF5 (Python, PyTables)

goes out of memory when saving large array with HDF5 (Python, PyTables)


By : C.C
Date : March 29 2020, 07:55 AM
I hope this helps you . When you are using PyTables the HDF5 file is kept in-memory until the file is closed (see more here: In-memory HDF5 files).
I will recommend you to have a look at the append and flush methods of PyTables, as I think that's exactly what you want. Be aware that flushing the buffer for every loop iteration will significantly reduce the performance of your code, due to the constant I/O that needs to be performed.
h5py: How to index over multiple large HDF5 files without loading all their content into memory

h5py: How to index over multiple large HDF5 files without loading all their content into memory


By : Melika Ghobakhloo
Date : March 29 2020, 07:55 AM
will help you HDF5 have introduced the concept of a "Virtual Dataset (VDS)". However, this does not work for versions before 1.10.
I have no experience with the VDS feature but the h5py docs go into more detail and the h5py git repository has an example file here :
code :
'''A simple example of building a virtual dataset.
This makes four 'source' HDF5 files, each with a 1D dataset of 100 numbers.
Then it makes a single 4x100 virtual dataset in a separate file, exposing
the four sources as one dataset.
'''

import h5py
import numpy as np

# Create source files (1.h5 to 4.h5)
for n in range(1, 5):
    with h5py.File('{}.h5'.format(n), 'w') as f:
        d = f.create_dataset('data', (100,), 'i4')
        d[:] = np.arange(100) + n

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')

for n in range(1, 5):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
    layout[n - 1] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
    f.create_virtual_dataset('data', layout, fillvalue=-5)
    print("Virtual dataset:")
    print(f['data'][:, :10])
Database vs flat files in python (need speed but can't fit in memory) to be used with generator for NN training

Database vs flat files in python (need speed but can't fit in memory) to be used with generator for NN training


By : Ionel Roman
Date : March 29 2020, 07:55 AM
With these it helps This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
code :
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]

def train_generator(batch_size):
    sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
    sample_file_ids = sample_rows.FILE_NAME.tolist()
    sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
    sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
    sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
    sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])

    while True:
        yield (sample_data, sample_labels)
model.fit_generator(train_generator(32), 
                    samples_per_epoch=10000, nb_epoch=20, 
                    validation_data=valid_generator(32), validation_steps=500)
What is the advantage of saving `.npz` files instead of `.npy` in python, regarding speed, memory and look-up?

What is the advantage of saving `.npz` files instead of `.npy` in python, regarding speed, memory and look-up?


By : Puja Bhatia
Date : March 29 2020, 07:55 AM
around this issue There are two parts of explanation for answering your question.
I. NPY vs. NPZ
code :
def _savez(file, args, kwds, compress, allow_pickle=True, pickle_kwargs=None):
    ...
    if compress:
        compression = zipfile.ZIP_DEFLATED
    else:
        compression = zipfile.ZIP_STORED
    ...


def savez(file, *args, **kwds):
    _savez(file, args, kwds, False)


def savez_compressed(file, *args, **kwds):
    _savez(file, args, kwds, True)
MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems

MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems


By : Lamsion Chan
Date : March 29 2020, 07:55 AM
may help you . I work on a large software system that has done both mechanisms for storing attachments and other content. The first iteration of the system stored all data in BLOBs in the DB. I cursed it at the time. As a programmer, I could write side scripts to immediately operate on the data and change it whenever I wanted to.
Advance about 10 years and I still manage the same software but the architecture has changed and it was written with filesystem pointers. I curse it now and wish it were back in the DB. I have the added benefit of several years and having worked this application in much greater capacity in many more and many larger situations, I feel my opinion now is better educated. Promotion or system migration of the application requires extensive scripting and copying of millions of files. On one occasion we changed the OS and all the file pointers had the wrong directory separator, or the server name changes where the file was and we had to write and schedule simple SQL update statements with the DBA on the weekend to fix. Another is that the filesystem and DB records get out of sync, why is uncertain but after thousands of days of operation, sometimes non-transactional systems (filesystem and DB don't share transactional contexts) simply become out of sync. Sometimes files mysteriously go missing.
Related Posts Related Posts :
  • Use `tf.image.resize_image_with_crop_or_pad` to resize numpy array
  • Sum number of occurences of string per row
  • Calculating 'Diagonal Distance' in 3 dimensions for A* path-finding heuristic
  • porting PyGST app to GStreamer1.0 + PyGI
  • Connection refused in Tornado test
  • How much time does take train SVM classifier?
  • Turning a string into list of positive and negative numbers
  • Python lists get specific length of elements from index
  • python.exe version 3.3.2 64 & 32 crash while creating .exe file on win 7 64 & 32 with cx_Freeze
  • Efficient nearest neighbour search for sparse matrices
  • django filter_horizontal can't display
  • How to install FLANN and pyflann on Windows
  • How can I plot the same figure standalone and in a subplot in Matplotlib?
  • read-only cells in ipython notebook
  • filling text file with dates
  • error:AttributeError: 'super' object has no attribute 'db_type' when run "python manage.py syncdb" in django
  • python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit
  • Write to csv: columns are shifted when item in row is empty (Python)
  • DuckDuckGo search returns 'List Index out of range'
  • Python function which can transverse a nested list and print out each element
  • Python installing xlwt module error
  • Python mayavi: Adding points to a 3d scatter plot
  • Making a basic web scraper in Python with only built in libraries - Python
  • How to calculate the angle of the sun above the horizon using pyEphem
  • Fix newlines when writing UTF-8 to Text file in python
  • How to convert backward slash command in python to run on Linux
  • PyCharm Code Inspection doesn't include PEP 8
  • How can I use Python namedtuples to insert rows into mySQL database
  • Increase / Decrease Mac Address in Python from String
  • Scrollable QLabel image in PyQt5
  • (Python 2.7) Access variable from class with accessor/mutator
  • Why does "from [Module] import [Something]" takes more time than "import [Module"
  • jira python oauth: how to get the parameters for authentication?
  • Python - How to specify a relative path by jumping a subdirectory?
  • Extract scientific number from string
  • Scrapy: Python cannot find the spider
  • get the values in a given radius from numpy array
  • Is it possible to duplicate a pipe in Python, so that it has one write end but two read ends?
  • Why does wget use Firefox cookies to login on an authenticated webpage?
  • python import behaviour: different objects from same file?
  • Create YoY Graph with Matplotlib
  • Safe use of eval() or alternatives - python
  • Unix change desktop background seamlessly
  • Profiling Python code that uses multiprocessing?
  • How to query a database after render_template
  • shifting right in for loop over indices in python
  • Is there a way to switch code indentation from tabs to spaces across the project, and to keep 'hg annotate' functionalit
  • Disable/Close/Quit/Exit Terminal screen from python on Geany (Ubuntu)
  • for i in xrange() not running the complete script
  • ImportError ropevim using ropevim plugin in vim
  • How to read each line from a file into list word by word in Python
  • Creating Unique Names
  • python split a string when a keyword comes after a pattern
  • Same Python code returns different results for same input string
  • Call a Flask function every few minutes
  • Python: Using Ghost for dynamic webscraping
  • How to make while iteration faster?
  • Struggling to resolve "a float is required error" in python
  • Read data with NAs into python and calculate mean row-wise
  • How to print telnet response line by line?
  • shadow
    Privacy Policy - Terms - Contact Us © ourworld-yourmove.org