Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D

EDIT

So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177

which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?

dask version: 0.15.1
fastparquet version: 0.1.1

Asked By: Daniel Severo

Source

Answer #1:

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

Answered By: Wes McKinney

Answer #2:

I believe this feature has finally been added in arrow/pyarrow 2.0.0:

https://issues.apache.org/jira/browse/ARROW-1644

https://arrow.apache.org/docs/python/json.html

Answered By: Pylander

Answer #3:

This is not exactly the right answer, but it can helps.

We could try to convert your dictionary to a pandas DataFrame, and after this write this to .parquet file:

import pandas as pd
from fastparquet import write, ParquetFile

d = {
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

df_test = pd.DataFrame(d)
write('file_test.parquet', df_test)

This would raise and error:

ValueError: Can't infer object conversion type: 0                                   abcdef
1                                    19.99
2    {'voltage': '110v', 'color': 'white'}
Name: product, dtype: object

So a easy solution is to convert the product column to lists:

df_test['product'] = df_test['product'].apply(lambda x: [x])

# this should now works
write('file_test.parquet', df_test)

# and now compare the file with the initial DataFrame
ParquetFile('file_test.parquet').to_pandas().explode('product')
    index            product                                 user
0   id               abcdef                             Daniel Severo
1   price             19.99                             Daniel Severo
2   specs   {'voltage': '110v', 'color': 'white'}       Daniel Severo

Answered By: igorkf

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .