Tengo un archivo json con el siguiente esquema:
root
|-- count: long (nullable = true)
|-- results: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- address: string (nullable = true)
| | |-- auto_task_assignment: boolean (nullable = true)
| | |-- deleted_at: string (nullable = true)
| | |-- has_issues: boolean (nullable = true)
| | |-- has_timetable: boolean (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- opening_hours: string (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- position_id: long (nullable = true)
| | |-- show_technical_time: boolean (nullable = true)
| | |-- structure_id: long (nullable = true)
| | |-- subcontract_number: string (nullable = true)
| | |-- task_modification: boolean (nullable = true)
| | |-- updated_at: string (nullable = true)
Quiero analizar los resultados de la matriz para obtener DataFrame con todas las columnas que se muestran en el esquema
Al tratar de usar la instrucción select, me ha dado un error.
df.select("results.*").show()
mensaje de error:
AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(results)`
Podría por favor ayudarme cómo filtrar este json?
datos de ejemplo:
{'count': 11, 'next': None, 'previous': None, 'results': [{'id': 1, 'name': 'Samodzielny Publiczny Szpital Kliniczny Nr 1 PUM', 'external_id': None, 'structure_id': 1, 'address': '71-252 Szczecin, Ul. Unii Lubelskiej 1 ', 'phone_number': '+48123456789', 'opening_hours': 'pn-pt: 9:00-17:00', 'deleted_at': '2021-05-27T13:02:12.026410+02:00', 'updated_at': '2021-05-27T13:02:12.026417+02:00', 'position_id': None, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': False, 'has_issues': False, 'show_technical_time': False, 'subcontract_number': None}, {'id': 2, 'name': 'Szpital polowy we wrocławiu', 'external_id': None, 'structure_id': 2, 'address': 'North Montytown, 0861 Greenholt Crescent', 'phone_number': '+48505505505', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-18T16:15:06.608476+01:00', 'position_id': 49, 'has_timetable': True, 'auto_task_assignment': False, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919....191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919191919919; 191919191991; 19991919919; 1919919 191919919; 1919191-255c'}, {'id': 3, 'name': 'Test', 'external_id': None, 'structure_id': 17, 'address': 'ul. Śliczna', 'phone_number': '+48500100107', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-04T14:22:04.712607+01:00', 'position_id': 33, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '07001234'}]}
He encontrado una solución con los Pandas DataFrame, pero mi objetivo es hacerlo con Chispa
enum = 0
for i in df['results']:
if enum == 0 :
df2 = pd.DataFrame(i, index=[0])
enum=+1
else:
df2 = df2.append(i, ignore_index=True)
Salida que se espera es mantener el número de columnas que se repita el mismo valor en cada fila y extraer todas las columnas de los resultados de la estructura, que se espera esquema de abajo:
root
|-- count: long (nullable = true)
|-- address: string (nullable = true)
|-- auto_task_assignment: boolean (nullable = true)
|-- deleted_at: string (nullable = true)
|-- has_issues: boolean (nullable = true)
|-- has_timetable: boolean (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- opening_hours: string (nullable = true)
|-- phone_number: string (nullable = true)
|-- position_id: long (nullable = true)
|-- show_technical_time: boolean (nullable = true)
|-- structure_id: long (nullable = true)
|-- subcontract_number: string (nullable = true)
|-- task_modification: boolean (nullable = true)
|-- updated_at: string (nullable = true)