PySpark cómo analizar anidada json

Question 1

Tengo un archivo json con el siguiente esquema:

root
 |-- count: long (nullable = true)
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- auto_task_assignment: boolean (nullable = true)
 |    |    |-- deleted_at: string (nullable = true)
 |    |    |-- has_issues: boolean (nullable = true)
 |    |    |-- has_timetable: boolean (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- opening_hours: string (nullable = true)
 |    |    |-- phone_number: string (nullable = true)
 |    |    |-- position_id: long (nullable = true)
 |    |    |-- show_technical_time: boolean (nullable = true)
 |    |    |-- structure_id: long (nullable = true)
 |    |    |-- subcontract_number: string (nullable = true)
 |    |    |-- task_modification: boolean (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

Quiero analizar los resultados de la matriz para obtener DataFrame con todas las columnas que se muestran en el esquema Al tratar de usar la instrucción select, me ha dado un error. df.select("results.*").show() mensaje de error: AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(results)` Podría por favor ayudarme cómo filtrar este json?

datos de ejemplo:

{'count': 11, 'next': None, 'previous': None, 'results': [{'id': 1, 'name': 'Samodzielny Publiczny Szpital Kliniczny Nr 1 PUM', 'external_id': None, 'structure_id': 1, 'address': '71-252 Szczecin, Ul. Unii Lubelskiej 1 ', 'phone_number': '+48123456789', 'opening_hours': 'pn-pt: 9:00-17:00', 'deleted_at': '2021-05-27T13:02:12.026410+02:00', 'updated_at': '2021-05-27T13:02:12.026417+02:00', 'position_id': None, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': False, 'has_issues': False, 'show_technical_time': False, 'subcontract_number': None}, {'id': 2, 'name': 'Szpital polowy we wrocławiu', 'external_id': None, 'structure_id': 2, 'address': 'North Montytown, 0861 Greenholt Crescent', 'phone_number': '+48505505505', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-18T16:15:06.608476+01:00', 'position_id': 49, 'has_timetable': True, 'auto_task_assignment': False, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919....191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919191919919; 191919191991; 19991919919; 1919919 191919919; 1919191-255c'}, {'id': 3, 'name': 'Test', 'external_id': None, 'structure_id': 17, 'address': 'ul. Śliczna', 'phone_number': '+48500100107', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-04T14:22:04.712607+01:00', 'position_id': 33, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '07001234'}]}

He encontrado una solución con los Pandas DataFrame, pero mi objetivo es hacerlo con Chispa

enum = 0
for i in df['results']:
    if enum == 0 :
        df2 = pd.DataFrame(i, index=[0])
        enum=+1
    else:
        df2 = df2.append(i, ignore_index=True)

Salida que se espera es mantener el número de columnas que se repita el mismo valor en cada fila y extraer todas las columnas de los resultados de la estructura, que se espera esquema de abajo:

root
 |-- count: long (nullable = true)
 |-- address: string (nullable = true)
 |-- auto_task_assignment: boolean (nullable = true)
 |-- deleted_at: string (nullable = true)
 |-- has_issues: boolean (nullable = true)
 |-- has_timetable: boolean (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- opening_hours: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- position_id: long (nullable = true)
 |-- show_technical_time: boolean (nullable = true)
 |-- structure_id: long (nullable = true)
 |-- subcontract_number: string (nullable = true)
 |-- task_modification: boolean (nullable = true)
 |-- updated_at: string (nullable = true)

Question 2

Usted necesita explode el results la matriz antes de onu-anidación de la estructura de los campos.

df.withColumn("results", F.explode(F.col("results"))).select("results.*").show()

Nithish · Answer 1 · 2021-11-23T21:40:16

Usted necesita explode el results la matriz antes de onu-anidación de la estructura de los campos.

df.withColumn("results", F.explode(F.col("results"))).select("results.*").show()

PySpark cómo analizar anidada json

Pregunta

Mejor respuesta

En otros idiomas

Esta página está en otros idiomas

Popular en esta categoría

Las preguntas más habituales en esta categoría