使用命令行查看Parquet文件

简介

通常来说Parquet文件可以使用Spark或Flink来读取内容。对于问题分析或者学习研究场景,临时查看一个parquet文件专门使用Spark/Flink编写一段程序显得十分繁琐。本篇为大家带来两个命令行环境运行的Parquet文件读取和分析工具。使用较为简单,无需再编写程序代码。

使用parquet-cli

项目地址和下载

项目地址:https://github.com/apache/parquet-java.git

下载地址:https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.1/parquet-cli-1.14.1-runtime.jar

官方使用方式和文档:https://github.com/apache/parquet-java/tree/master/parquet-cli

使用方式

命令格式:

hadoop jar parquet-cli-1.14.1-runtime.jar 命令 本地parquet文件路径

查看帮助:

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar help

Usage: parquet [options] [command] [command options]

  Options:

    -v, --verbose, --debug
        Print extra debugging information

  Commands:

    help
        Retrieves details on the functions of other commands
    meta
        Print a Parquet file's metadata
    pages
        Print page summaries for a Parquet file
    dictionary
        Print dictionaries for a Parquet column
    check-stats
        Check Parquet files for corrupt page and column stats (PARQUET-251)
    schema
        Print the Avro schema for a file
    csv-schema
        Build a schema from a CSV data sample
    convert-csv
        Create a file from CSV data
    convert
        Create a Parquet file from a data file
    to-avro
        Create an Avro file from a data file
    cat
        Print the first N records from a file
    head
        Print the first N records from a file
    column-index
        Prints the column and offset indexes of a Parquet file
    column-size
        Print the column sizes of a parquet file
    prune
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
    trans-compression
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
    footer
        Print the Parquet file footer in json format
    bloom-filter
        Check bloom filters for a Parquet column
    scan
        Scan all records from a file
    rewrite
        Rewrite one or more Parquet files to a new Parquet file

  Examples:

    # print information for meta
    parquet help meta

  See 'parquet help <command>' for more information on a specific command.

使用示例

这里以一个Hudi表底层的parquet文件为例。说明parquet-cli工具的使用方式。

查看parquet文件的schema:

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar schema ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{
  "type" : "record",
  "name" : "hudi_student_record",
  "namespace" : "hoodie.hudi_student",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "id",
    "type" : "int"
  }, {
    "name" : "name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "tel",
    "type" : [ "null", "int" ],
    "default" : null
  } ]
}

查看parquet文件数据:

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar cat ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
{"_hoodie_commit_time": "20240710084244349", "_hoodie_commit_seqno": "20240710084244349_0_7", "_hoodie_record_key": "2", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 2, "name": "Mary", "tel": 222222}
{"_hoodie_commit_time": "20240710083659244", "_hoodie_commit_seqno": "20240710083659244_0_3", "_hoodie_record_key": "5", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 5, "name": "Tom", "tel": 666666}

查看parquet文件前3行数据:

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar head -n 3 ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}

获取parquet文件meta信息:

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar meta ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet

File path:  ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
    org.apache.hudi.bloomfilter: //太长省略
          hoodie_min_record_key: 1
            parquet.avro.schema: {"type":"record","name":"hudi_student_record","namespace":"hoodie.hudi_student","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"name","type":["null","string"],"default":null},{"name":"tel","type":["null","int"],"default":null}]}
              writer.model.name: avro
          hoodie_max_record_key: 5
Schema:
message hoodie.hudi_student.hudi_student_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  required int32 id;
  optional binary name (STRING);
  optional int32 tel;
}


Row group 0:  count: 5  152.20 B records  start: 4  total(compressed): 761 B total(uncompressed):702 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G   _     5         19.60 B    0       "20240710083659244" / "20240710084413943"
_hoodie_commit_seqno    BINARY    G   _     5         21.80 B    0       "20240710083659244_0_3" / "20240710084413943_0_11"
_hoodie_record_key      BINARY    G   _     5         12.60 B    0       "1" / "5"
_hoodie_partition_path  BINARY    G _ R     5         18.80 B    0       "" / ""
_hoodie_file_name       BINARY    G _ R     5         31.20 B    0       "ba74ba57-d45c-43c7-9ddb-7..." / "ba74ba57-d45c-43c7-9ddb-7..."
id                      INT32     G   _     5         11.40 B    0       "1" / "5"
name                    BINARY    G   _     5         16.00 B    0       "Jessy" / "Tom"
tel                     INT32     G _ R     5         20.80 B    0       "111111" / "666666"
    

使用parquet-tools

下载方式

下载jar文件:

wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar

使用方式

hadoop jar parquet-tools-1.x.0.jar 命令 HDFS中parquet文件路径

命令使用方式和前面的parquet-cli工具相同,不再赘述。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,088评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,715评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,361评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,099评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 60,987评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,063评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,486评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,175评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,440评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,518评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,305评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,190评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,550评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,880评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,152评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,451评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,637评论 2 335

推荐阅读更多精彩内容