jq is a lightweight and flexible command-line JSON processor. You can use jq
on a local development machine to
slice, filter, map, and transform the JSON data that Unstructured outputs in much the same ways that tools such as sed
, awk
, and grep
let you work with text.
To get jq
, see the Download jq page.
jq
is not owned or supported by Unstructured. For questions about jq
and
feature requests for future versions of jq
, see the Issues tab of the
jq
repository in GitHub.
The following command examples use jq
with the
spring-weather.html.json file in the
example-docs directory within the Unstructured-IO/unstructured repository in GitHub.
Find the element with a type
of Address
, and print the element’s text
field’s value.
jq '.[]
| select(.type == "Address")
| .text' spring-weather.html.json
# Output:
#
# "Silver Spring, MD 20910"
Find all elements with a type
of Title
, and print the text
field of each found element as a string in a JSON array.
jq '[
.[]
| select(.type == "Title")
| .text]' spring-weather.html.json
# Output:
#
# [
# "News Around NOAA",
# "National Program",
# "Are You Weather-Ready for the Spring?",
# "Weather.gov >",
# "News Around NOAA > Are You Weather-Ready for the Spring?",
# "US Dept of Commerce",
# "National Oceanic and Atmospheric Administration",
# "National Weather Service",
# "News Around NOAA",
# "1325 East West Highway",
# "Comments? Questions? Please Contact Us.",
# "Disclaimer",
# "Information Quality",
# "Help",
# "Glossary",
# "Privacy Policy",
# "Freedom of Information Act (FOIA)",
# "About Us",
# "Career Opportunities"
# ]
Find all elements with a type
of Title
. Of these, find the ones that have a text
field that contains the phrase Contact Us
, and print the contents of each found element’s metadata.link_urls
field.
jq '.[]
| select(.type == "Title")
| select(.text
| contains("Contact Us"))
| .metadata.link_urls' spring-weather.html.json
# Output:
#
# [
# "https://www.weather.gov/news/contact"
# ]
Find all elements with a type
of ListItem
. Of these, find the ones that have a text
field that contains the phrase Weather Safety
.
For each item in metadata.link_texts
, print the item’s value as the key, followed by the matching item in
metadata.link_urls
as the value. Trim any leading and trailing whitespace from all values. Wrap the output in a JSON array.
jq '[
.[]
| select(.type == "ListItem")
| select(.text | test("Weather Safety"; "i"))
| [.metadata.link_texts, .metadata.link_urls]
| transpose[]
| {
(.[0] | gsub("^\\s+|\\s+$"; "")) : (.[1] | gsub("^\\s+|\\s+$"; ""))
}
]' spring-weather.html.json
# Output:
#
# [
# {
# "Weather Safety": "http://www.weather.gov/safetycampaign"
# },
# {
# "Air Quality": "https://www.weather.gov/safety/airquality"
# },
# {
# "Beach Hazards": "https://www.weather.gov/safety/beachhazards"
# },
# {
# "Cold": "https://www.weather.gov/safety/cold"
# },
# {
# "Cold Water": "https://www.weather.gov/safety/coldwater"
# },
# {
# "Drought": "https://www.weather.gov/safety/drought"
# },
# {
# "Floods": "https://www.weather.gov/safety/flood"
# },
# {
# "Fog": "https://www.weather.gov/safety/fog"
# },
# {
# "Heat": "https://www.weather.gov/safety/heat"
# },
# {
# "Hurricanes": "https://www.weather.gov/safety/hurricane"
# },
# {
# "Lightning Safety": "https://www.weather.gov/safety/lightning"
# },
# {
# "Rip Currents": "https://www.weather.gov/safety/ripcurrent"
# },
# {
# "Safe Boating": "https://www.weather.gov/safety/safeboating"
# },
# {
# "Space Weather": "https://www.weather.gov/safety/space"
# },
# {
# "Sun (Ultraviolet Radiation)": "https://www.weather.gov/safety/heat-uv"
# },
# {
# "Thunderstorms & Tornadoes": "https://www.weather.gov/safety/thunderstorm"
# },
# {
# "Tornado": "https://www.weather.gov/safety/tornado"
# },
# {
# "Tsunami": "https://www.weather.gov/safety/tsunami"
# },
# {
# "Wildfire": "https://www.weather.gov/safety/wildfire"
# },
# {
# "Wind": "https://www.weather.gov/safety/wind"
# },
# {
# "Winter": "https://www.weather.gov/safety/winter"
# }
# ]
Additional resources