PySpark cómo analizar anidada json

0

Pregunta

Tengo un archivo json con el siguiente esquema:

root
 |-- count: long (nullable = true)
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- auto_task_assignment: boolean (nullable = true)
 |    |    |-- deleted_at: string (nullable = true)
 |    |    |-- has_issues: boolean (nullable = true)
 |    |    |-- has_timetable: boolean (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- opening_hours: string (nullable = true)
 |    |    |-- phone_number: string (nullable = true)
 |    |    |-- position_id: long (nullable = true)
 |    |    |-- show_technical_time: boolean (nullable = true)
 |    |    |-- structure_id: long (nullable = true)
 |    |    |-- subcontract_number: string (nullable = true)
 |    |    |-- task_modification: boolean (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

Quiero analizar los resultados de la matriz para obtener DataFrame con todas las columnas que se muestran en el esquema Al tratar de usar la instrucción select, me ha dado un error. df.select("results.*").show() mensaje de error: AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(results)` Podría por favor ayudarme cómo filtrar este json?

datos de ejemplo:

{'count': 11, 'next': None, 'previous': None, 'results': [{'id': 1, 'name': 'Samodzielny Publiczny Szpital Kliniczny Nr 1 PUM', 'external_id': None, 'structure_id': 1, 'address': '71-252 Szczecin, Ul. Unii Lubelskiej 1 ', 'phone_number': '+48123456789', 'opening_hours': 'pn-pt: 9:00-17:00', 'deleted_at': '2021-05-27T13:02:12.026410+02:00', 'updated_at': '2021-05-27T13:02:12.026417+02:00', 'position_id': None, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': False, 'has_issues': False, 'show_technical_time': False, 'subcontract_number': None}, {'id': 2, 'name': 'Szpital polowy we wrocławiu', 'external_id': None, 'structure_id': 2, 'address': 'North Montytown, 0861 Greenholt Crescent', 'phone_number': '+48505505505', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-18T16:15:06.608476+01:00', 'position_id': 49, 'has_timetable': True, 'auto_task_assignment': False, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919....191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919191919919; 191919191991; 19991919919; 1919919 191919919; 1919191-255c'}, {'id': 3, 'name': 'Test', 'external_id': None, 'structure_id': 17, 'address': 'ul. Śliczna', 'phone_number': '+48500100107', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-04T14:22:04.712607+01:00', 'position_id': 33, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '07001234'}]}

He encontrado una solución con los Pandas DataFrame, pero mi objetivo es hacerlo con Chispa

enum = 0
for i in df['results']:
    if enum == 0 :
        df2 = pd.DataFrame(i, index=[0])
        enum=+1
    else:
        df2 = df2.append(i, ignore_index=True)

Salida que se espera es mantener el número de columnas que se repita el mismo valor en cada fila y extraer todas las columnas de los resultados de la estructura, que se espera esquema de abajo:

root
 |-- count: long (nullable = true)
 |-- address: string (nullable = true)
 |-- auto_task_assignment: boolean (nullable = true)
 |-- deleted_at: string (nullable = true)
 |-- has_issues: boolean (nullable = true)
 |-- has_timetable: boolean (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- opening_hours: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- position_id: long (nullable = true)
 |-- show_technical_time: boolean (nullable = true)
 |-- structure_id: long (nullable = true)
 |-- subcontract_number: string (nullable = true)
 |-- task_modification: boolean (nullable = true)
 |-- updated_at: string (nullable = true)
pyspark python
2021-11-23 20:40:10
1

Mejor respuesta

2

Usted necesita explode el results la matriz antes de onu-anidación de la estructura de los campos.

df.withColumn("results", F.explode(F.col("results"))).select("results.*").show()

2021-11-23 21:40:16

En otros idiomas

Esta página está en otros idiomas

Русский
..................................................................................................................
Italiano
..................................................................................................................
Polski
..................................................................................................................
Română
..................................................................................................................
한국어
..................................................................................................................
हिन्दी
..................................................................................................................
Français
..................................................................................................................
Türk
..................................................................................................................
Česk
..................................................................................................................
Português
..................................................................................................................
ไทย
..................................................................................................................
中文
..................................................................................................................
Slovenský
..................................................................................................................